Low-Latency Tuning¶
Architectural optimizations for sub-millisecond tick-to-trade latency.
Overview¶
Achieving consistent low latency requires optimizing four areas:
| Area | Problem | Solution |
|---|---|---|
| Thread Scheduling | OS scheduler jitter adds milliseconds | Core pinning |
| Memory Allocation | Dynamic allocation causes pauses | Object pools |
| CPU Cache | False sharing degrades performance | Cache-line alignment |
| I/O Patterns | Blocking receives waste latency budget | Busy-polling |
Thread Affinity¶
Pin critical threads to dedicated CPU cores to eliminate scheduler migration.
Configuration¶
use core_affinity::CoreId;
pub struct ThreadConfig {
pub market_data_core: usize, // e.g., 0
pub execution_core: usize, // e.g., 1
pub metrics_core: usize, // e.g., 2
}
Pinning a Thread¶
use core_affinity::CoreId;
std::thread::spawn(move || {
// Pin thread to core - fail loudly if pinning fails
if !core_affinity::set_for_current(CoreId { id: core_id }) {
panic!("Failed to pin thread to core {}", core_id);
}
// Set high priority (requires CAP_SYS_NICE on Linux)
#[cfg(target_os = "linux")]
{
use thread_priority::{ThreadPriority, set_current_thread_priority};
let _ = set_current_thread_priority(ThreadPriority::Max);
}
// Run hot path loop
run_loop();
});
Why Core Pinning Matters¶
| Benefit | Impact |
|---|---|
| Consistent TSC readings | TSC frequency can vary between cores |
| No context switch overhead | Eliminates scheduler migration latency |
| Better cache utilization | L1/L2 cache stays hot |
| Predictable latency | Removes scheduler-induced variance |
Thread Architecture¶
Core 0 (Pinned) Core 1 (Pinned) Core 2 (Lower Priority)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Market Data │──────>│ Execution │ │ Metrics │
│ WebSocket recv │ │ Order submit │ │ Aggregation │
│ Parse & validate│ │ State machine │ │ Prometheus │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Busy-Polling¶
Use non-blocking receives with adaptive backoff for minimal latency.
Why Not std::sync::mpsc?¶
| Channel | Latency | Notes |
|---|---|---|
std::sync::mpsc |
100-300ns | System allocator involvement |
crossbeam::channel |
20-50ns | Lock-free, cache-friendly |
Always use crossbeam::channel for hot path communication.
Adaptive Busy-Poll Loop¶
use crossbeam::channel::{Receiver, TryRecvError};
pub struct BusyPollConfig {
/// Max spin iterations before yielding
pub max_spin_iterations: u32,
/// Use PAUSE instruction during spin
pub use_spin_hint: bool,
}
impl Default for BusyPollConfig {
fn default() -> Self {
Self {
max_spin_iterations: 1000, // ~1-10us depending on CPU
use_spin_hint: true,
}
}
}
pub fn run_loop<T>(rx: Receiver<T>, config: BusyPollConfig) {
let mut spin_count = 0u32;
loop {
match rx.try_recv() {
Ok(msg) => {
process(msg);
spin_count = 0; // Reset on work
}
Err(TryRecvError::Empty) => {
if spin_count < config.max_spin_iterations {
spin_count += 1;
if config.use_spin_hint {
std::hint::spin_loop(); // PAUSE on x86
}
} else {
std::thread::yield_now(); // Brief yield
spin_count = 0;
}
}
Err(TryRecvError::Disconnected) => break,
}
}
}
Spin Loop Behavior¶
| Phase | Action | CPU Usage |
|---|---|---|
| Active | try_recv() succeeds |
Normal |
| Spinning | spin_loop() hint |
100% but power-aware |
| Yielding | yield_now() |
Reduced |
The spin_loop() hint:
- x86: PAUSE instruction (reduces power, prevents pipeline stalls)
- ARM: YIELD instruction (hints to processor)
Object Pools¶
Pre-allocate all hot path objects to eliminate allocation latency.
Fixed-Size Pool with Slab¶
use slab::Slab;
pub struct OrderPool {
orders: Slab<Order>,
max_capacity: usize,
}
impl OrderPool {
pub fn new(capacity: usize) -> Self {
let mut orders = Slab::with_capacity(capacity);
// Pre-warm: fill and release to fault pages
let keys: Vec<_> = (0..capacity)
.map(|_| orders.insert(Order::default()))
.collect();
for key in keys {
orders.remove(key);
}
Self { orders, max_capacity: capacity }
}
/// Allocate from pool (O(1), no allocation)
#[inline]
pub fn allocate(&mut self) -> Option<usize> {
if self.orders.len() >= self.max_capacity {
return None; // Reject, don't grow
}
Some(self.orders.insert(Order::default()))
}
/// Return to pool for reuse (O(1))
#[inline]
pub fn release(&mut self, key: usize) {
self.orders.remove(key);
}
/// Access pooled object
#[inline]
pub fn get_mut(&mut self, key: usize) -> Option<&mut Order> {
self.orders.get_mut(key)
}
}
Pool Sizing Guidelines¶
| Pool | Capacity | Rationale |
|---|---|---|
| OrderPool | 1000 | Peak concurrent orders |
| MarketDataPool | 100 | Buffer for incoming messages |
| OpportunityPool | 50 | Detected arbitrage opportunities |
Size pools for 2x expected peak to avoid exhaustion.
Pre-Warming¶
Pre-warming ensures memory pages are faulted before trading:
// Fill pool completely
for _ in 0..capacity {
let key = orders.insert(Order::default());
// Touch the memory to fault the page
orders.get_mut(key).unwrap().price = 0.0;
}
// Release all slots (pages remain in memory)
for key in keys {
orders.remove(key);
}
Cache-Line Alignment¶
Prevent false sharing by aligning data to 64-byte cache lines.
The Problem¶
Without alignment, adjacent atomics share a cache line:
// BAD: Both fields in same cache line
struct Counters {
thread1_counter: AtomicU64, // bytes 0-7
thread2_counter: AtomicU64, // bytes 8-15
}
// Thread 1 writes → Thread 2's cache line invalidated
The Solution¶
/// Cache-line aligned wrapper
#[repr(C, align(64))]
pub struct CacheAligned<T> {
pub value: T,
}
impl<T> CacheAligned<T> {
pub fn new(value: T) -> Self {
Self { value }
}
}
// GOOD: Each counter on its own cache line
struct Counters {
thread1_counter: CacheAligned<AtomicU64>, // bytes 0-63
thread2_counter: CacheAligned<AtomicU64>, // bytes 64-127
}
Per-Thread Counters Example¶
use std::sync::atomic::{AtomicU64, Ordering};
pub struct ThreadCounters {
pub messages_processed: CacheAligned<AtomicU64>,
pub orders_submitted: CacheAligned<AtomicU64>,
pub latency_sum_ns: CacheAligned<AtomicU64>,
}
impl ThreadCounters {
pub fn new() -> Self {
Self {
messages_processed: CacheAligned::new(AtomicU64::new(0)),
orders_submitted: CacheAligned::new(AtomicU64::new(0)),
latency_sum_ns: CacheAligned::new(AtomicU64::new(0)),
}
}
#[inline]
pub fn inc_messages(&self) {
self.messages_processed.value.fetch_add(1, Ordering::Relaxed);
}
}
Environment Configuration¶
Linux Performance Settings¶
# Disable CPU frequency scaling (run as root)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Increase file descriptor limits
ulimit -n 65536
# Enable transparent huge pages (optional)
echo always > /sys/kernel/mm/transparent_hugepage/enabled
Thread Priority (CAP_SYS_NICE)¶
To use high thread priority on Linux:
# Option 1: setcap on binary
sudo setcap cap_sys_nice+ep target/release/arbiter-engine
# Option 2: Run as root (not recommended for production)
Benchmarking¶
Measure Baseline¶
# Run with default settings
cargo run --release -- --paper-trade
# Check metrics
curl localhost:9090/metrics | grep tick_to_trade
Compare Configurations¶
| Configuration | p50 | p99 | p99.99 |
|---|---|---|---|
| Default (no optimizations) | 200μs | 2ms | 15ms |
| + Core pinning | 150μs | 800μs | 5ms |
| + Busy-polling | 80μs | 400μs | 2ms |
| + Object pools | 70μs | 300μs | 1ms |
| + Cache alignment | 60μs | 250μs | 800μs |
Values are illustrative; actual results depend on hardware and workload.
Best Practices¶
Do¶
- Pin hot path threads to dedicated cores
- Use
crossbeam::channelinstead ofstd::sync::mpsc - Pre-allocate all hot path objects
- Align frequently-accessed atomics to cache lines
- Fail loudly if core pinning fails
Don't¶
- Allocate memory on the hot path
- Use blocking channel receives
- Share cache lines between threads
- Spin indefinitely without yielding
- Assume default thread scheduling is optimal
Troubleshooting¶
Core Pinning Fails¶
Causes:
- Core doesn't exist (check nproc)
- Running in container without CPU affinity
- Insufficient permissions
Solutions:
- Verify core count: cat /proc/cpuinfo | grep processor
- Docker: use --cpuset-cpus
- Grant CAP_SYS_NICE capability
Pool Exhaustion¶
Solutions: - Increase pool capacity - Investigate why orders aren't being released - Add monitoring for pool utilization
High Tail Latency Despite Optimizations¶
Check for: - Garbage collection in dependent services - Network latency variance - Disk I/O from logging - Other processes on pinned cores
Related Documentation¶
- ADR-013: Low-Latency Optimizations - Architecture decision
- Performance Monitoring - Measurement and metrics
- CLI Reference - Runtime configuration