Low-Latency Tuning¶

Architectural optimizations for sub-millisecond tick-to-trade latency.

Overview¶

Achieving consistent low latency requires optimizing four areas:

Area	Problem	Solution
Thread Scheduling	OS scheduler jitter adds milliseconds	Core pinning
Memory Allocation	Dynamic allocation causes pauses	Object pools
CPU Cache	False sharing degrades performance	Cache-line alignment
I/O Patterns	Blocking receives waste latency budget	Busy-polling

Thread Affinity¶

Pin critical threads to dedicated CPU cores to eliminate scheduler migration.

Configuration¶

use core_affinity::CoreId;

pub struct ThreadConfig {
    pub market_data_core: usize,  // e.g., 0
    pub execution_core: usize,    // e.g., 1
    pub metrics_core: usize,      // e.g., 2
}

Pinning a Thread¶

use core_affinity::CoreId;

std::thread::spawn(move || {
    // Pin thread to core - fail loudly if pinning fails
    if !core_affinity::set_for_current(CoreId { id: core_id }) {
        panic!("Failed to pin thread to core {}", core_id);
    }

    // Set high priority (requires CAP_SYS_NICE on Linux)
    #[cfg(target_os = "linux")]
    {
        use thread_priority::{ThreadPriority, set_current_thread_priority};
        let _ = set_current_thread_priority(ThreadPriority::Max);
    }

    // Run hot path loop
    run_loop();
});

Why Core Pinning Matters¶

Benefit	Impact
Consistent TSC readings	TSC frequency can vary between cores
No context switch overhead	Eliminates scheduler migration latency
Better cache utilization	L1/L2 cache stays hot
Predictable latency	Removes scheduler-induced variance

Thread Architecture¶

Core 0 (Pinned)           Core 1 (Pinned)           Core 2 (Lower Priority)
┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│ Market Data     │──────>│ Execution       │       │ Metrics         │
│ WebSocket recv  │       │ Order submit    │       │ Aggregation     │
│ Parse & validate│       │ State machine   │       │ Prometheus      │
└─────────────────┘       └─────────────────┘       └─────────────────┘

Busy-Polling¶

Use non-blocking receives with adaptive backoff for minimal latency.

Why Not std::sync::mpsc?¶

Channel	Latency	Notes
`std::sync::mpsc`	100-300ns	System allocator involvement
`crossbeam::channel`	20-50ns	Lock-free, cache-friendly

Always use crossbeam::channel for hot path communication.

Adaptive Busy-Poll Loop¶

use crossbeam::channel::{Receiver, TryRecvError};

pub struct BusyPollConfig {
    /// Max spin iterations before yielding
    pub max_spin_iterations: u32,
    /// Use PAUSE instruction during spin
    pub use_spin_hint: bool,
}

impl Default for BusyPollConfig {
    fn default() -> Self {
        Self {
            max_spin_iterations: 1000,  // ~1-10us depending on CPU
            use_spin_hint: true,
        }
    }
}

pub fn run_loop<T>(rx: Receiver<T>, config: BusyPollConfig) {
    let mut spin_count = 0u32;

    loop {
        match rx.try_recv() {
            Ok(msg) => {
                process(msg);
                spin_count = 0;  // Reset on work
            }
            Err(TryRecvError::Empty) => {
                if spin_count < config.max_spin_iterations {
                    spin_count += 1;
                    if config.use_spin_hint {
                        std::hint::spin_loop();  // PAUSE on x86
                    }
                } else {
                    std::thread::yield_now();  // Brief yield
                    spin_count = 0;
                }
            }
            Err(TryRecvError::Disconnected) => break,
        }
    }
}

Spin Loop Behavior¶

Phase	Action	CPU Usage
Active	`try_recv()` succeeds	Normal
Spinning	`spin_loop()` hint	100% but power-aware
Yielding	`yield_now()`	Reduced

The spin_loop() hint: - x86: PAUSE instruction (reduces power, prevents pipeline stalls) - ARM: YIELD instruction (hints to processor)

Object Pools¶

Pre-allocate all hot path objects to eliminate allocation latency.

Fixed-Size Pool with Slab¶

use slab::Slab;

pub struct OrderPool {
    orders: Slab<Order>,
    max_capacity: usize,
}

impl OrderPool {
    pub fn new(capacity: usize) -> Self {
        let mut orders = Slab::with_capacity(capacity);

        // Pre-warm: fill and release to fault pages
        let keys: Vec<_> = (0..capacity)
            .map(|_| orders.insert(Order::default()))
            .collect();

        for key in keys {
            orders.remove(key);
        }

        Self { orders, max_capacity: capacity }
    }

    /// Allocate from pool (O(1), no allocation)
    #[inline]
    pub fn allocate(&mut self) -> Option<usize> {
        if self.orders.len() >= self.max_capacity {
            return None;  // Reject, don't grow
        }
        Some(self.orders.insert(Order::default()))
    }

    /// Return to pool for reuse (O(1))
    #[inline]
    pub fn release(&mut self, key: usize) {
        self.orders.remove(key);
    }

    /// Access pooled object
    #[inline]
    pub fn get_mut(&mut self, key: usize) -> Option<&mut Order> {
        self.orders.get_mut(key)
    }
}

Pool Sizing Guidelines¶

Pool	Capacity	Rationale
OrderPool	1000	Peak concurrent orders
MarketDataPool	100	Buffer for incoming messages
OpportunityPool	50	Detected arbitrage opportunities

Size pools for 2x expected peak to avoid exhaustion.

Pre-Warming¶

Pre-warming ensures memory pages are faulted before trading:

// Fill pool completely
for _ in 0..capacity {
    let key = orders.insert(Order::default());
    // Touch the memory to fault the page
    orders.get_mut(key).unwrap().price = 0.0;
}

// Release all slots (pages remain in memory)
for key in keys {
    orders.remove(key);
}

Cache-Line Alignment¶

Prevent false sharing by aligning data to 64-byte cache lines.

The Problem¶

Without alignment, adjacent atomics share a cache line:

// BAD: Both fields in same cache line
struct Counters {
    thread1_counter: AtomicU64,  // bytes 0-7
    thread2_counter: AtomicU64,  // bytes 8-15
}
// Thread 1 writes → Thread 2's cache line invalidated

The Solution¶

/// Cache-line aligned wrapper
#[repr(C, align(64))]
pub struct CacheAligned<T> {
    pub value: T,
}

impl<T> CacheAligned<T> {
    pub fn new(value: T) -> Self {
        Self { value }
    }
}

// GOOD: Each counter on its own cache line
struct Counters {
    thread1_counter: CacheAligned<AtomicU64>,  // bytes 0-63
    thread2_counter: CacheAligned<AtomicU64>,  // bytes 64-127
}

Per-Thread Counters Example¶

use std::sync::atomic::{AtomicU64, Ordering};

pub struct ThreadCounters {
    pub messages_processed: CacheAligned<AtomicU64>,
    pub orders_submitted: CacheAligned<AtomicU64>,
    pub latency_sum_ns: CacheAligned<AtomicU64>,
}

impl ThreadCounters {
    pub fn new() -> Self {
        Self {
            messages_processed: CacheAligned::new(AtomicU64::new(0)),
            orders_submitted: CacheAligned::new(AtomicU64::new(0)),
            latency_sum_ns: CacheAligned::new(AtomicU64::new(0)),
        }
    }

    #[inline]
    pub fn inc_messages(&self) {
        self.messages_processed.value.fetch_add(1, Ordering::Relaxed);
    }
}

Environment Configuration¶

Linux Performance Settings¶

# Disable CPU frequency scaling (run as root)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Increase file descriptor limits
ulimit -n 65536

# Enable transparent huge pages (optional)
echo always > /sys/kernel/mm/transparent_hugepage/enabled

Thread Priority (CAP_SYS_NICE)¶

To use high thread priority on Linux:

# Option 1: setcap on binary
sudo setcap cap_sys_nice+ep target/release/arbiter-engine

# Option 2: Run as root (not recommended for production)

Benchmarking¶

Measure Baseline¶

# Run with default settings
cargo run --release -- --paper-trade

# Check metrics
curl localhost:9090/metrics | grep tick_to_trade

Compare Configurations¶

Configuration	p50	p99	p99.99
Default (no optimizations)	200μs	2ms	15ms
+ Core pinning	150μs	800μs	5ms
+ Busy-polling	80μs	400μs	2ms
+ Object pools	70μs	300μs	1ms
+ Cache alignment	60μs	250μs	800μs

Values are illustrative; actual results depend on hardware and workload.

Best Practices¶

Do¶

Pin hot path threads to dedicated cores
Use crossbeam::channel instead of std::sync::mpsc
Pre-allocate all hot path objects
Align frequently-accessed atomics to cache lines
Fail loudly if core pinning fails

Don't¶

Allocate memory on the hot path
Use blocking channel receives
Share cache lines between threads
Spin indefinitely without yielding
Assume default thread scheduling is optimal

Troubleshooting¶

Core Pinning Fails¶

Failed to pin thread to core 0

Causes: - Core doesn't exist (check nproc) - Running in container without CPU affinity - Insufficient permissions

Solutions: - Verify core count: cat /proc/cpuinfo | grep processor - Docker: use --cpuset-cpus - Grant CAP_SYS_NICE capability

Pool Exhaustion¶

OrderPool: capacity exhausted (1000/1000)

Solutions: - Increase pool capacity - Investigate why orders aren't being released - Add monitoring for pool utilization

High Tail Latency Despite Optimizations¶

Check for: - Garbage collection in dependent services - Network latency variance - Disk I/O from logging - Other processes on pinned cores

ADR-013: Low-Latency Optimizations - Architecture decision
Performance Monitoring - Measurement and metrics
CLI Reference - Runtime configuration