ADR 013: Low-Latency Optimizations¶

Status¶

Accepted

Context¶

Arbiter-Bot is a statistical arbitrage engine where tick-to-trade latency directly impacts profitability. Arbitrage opportunities exist for milliseconds; slow execution means missed profits or adverse fills.

While Rust's lack of garbage collection provides a strong foundation, achieving consistent sub-millisecond latency requires explicit optimization of:

Thread scheduling - OS scheduler jitter can add milliseconds of latency
Memory allocation - Dynamic allocation on the hot path causes unpredictable pauses
CPU cache behavior - False sharing between threads degrades performance
I/O patterns - Blocking on channel receives wastes latency budget

This ADR covers architectural optimizations to the trading engine. Performance measurement and observability are covered in ADR-012.

Decision¶

Implement four key optimizations for the trading hot path.

1. Thread Affinity with Core Pinning¶

Pin critical threads to dedicated CPU cores to eliminate scheduler migration and ensure consistent TSC readings:

use core_affinity::CoreId;

pub struct TradingThreadConfig {
    /// Pin market data thread to dedicated core
    pub market_data_core: usize,
    /// Pin execution thread to dedicated core
    pub execution_core: usize,
    /// Pin metrics aggregator to separate core (lower priority)
    pub metrics_core: usize,
}

impl TradingRuntime {
    pub fn spawn_market_data_thread(config: &TradingThreadConfig, rx: Receiver<MarketData>) {
        std::thread::spawn(move || {
            // Pin to core; fail loudly if pinning fails
            if !core_affinity::set_for_current(CoreId { id: config.market_data_core }) {
                panic!(
                    "Failed to pin market data thread to core {}. \
                     TSC calibration assumes no core migration.",
                    config.market_data_core
                );
            }

            // Set thread priority (requires CAP_SYS_NICE on Linux)
            #[cfg(target_os = "linux")]
            {
                let _ = thread_priority::set_current_thread_priority(
                    thread_priority::ThreadPriority::Max
                );
            }

            run_market_data_loop(rx);
        });
    }
}

Rationale: - TSC frequency can vary between cores; pinning ensures consistent timing - Eliminates context switch overhead from scheduler migrations - Reduces L1/L2 cache invalidation from core hopping - Fails loudly rather than silently degrading performance

2. Adaptive Busy-Polling¶

Use busy-polling with adaptive backoff for the market data receive loop.

Important: Use crossbeam::channel instead of std::sync::mpsc for microsecond-scale latency. The standard library's MPSC channel has ~100-300ns overhead per operation; crossbeam's bounded channel achieves ~20-50ns.

use crossbeam::channel::{Receiver, TryRecvError};

pub struct BusyPollConfig {
    /// Maximum spin iterations before yielding (0 = always yield)
    pub max_spin_iterations: u32,
    /// Whether to use PAUSE instruction during spin (reduces power)
    pub use_spin_hint: bool,
}

impl Default for BusyPollConfig {
    fn default() -> Self {
        Self {
            max_spin_iterations: 1000,  // ~1-10μs depending on CPU
            use_spin_hint: true,
        }
    }
}

pub fn run_market_data_loop(rx: Receiver<MarketData>, config: BusyPollConfig) {
    let mut spin_count = 0u32;

    loop {
        match rx.try_recv() {
            Ok(msg) => {
                process_market_data(msg);
                spin_count = 0;  // Reset on successful work
            }
            Err(TryRecvError::Empty) => {
                if spin_count < config.max_spin_iterations {
                    spin_count += 1;
                    if config.use_spin_hint {
                        std::hint::spin_loop();  // PAUSE on x86, YIELD on ARM
                    }
                } else {
                    // Yield to OS briefly to reduce power/thermal impact
                    std::thread::yield_now();
                    spin_count = 0;
                }
            }
            Err(TryRecvError::Disconnected) => break,
        }
    }
}

Rationale: - try_recv() avoids blocking syscall overhead - Spinning keeps the thread hot and ready for immediate processing - Adaptive backoff prevents 100% CPU usage during idle periods - spin_loop() hint reduces power consumption while spinning - Configurable to tune for latency vs. power trade-off

3. Fixed-Size Object Pools¶

Pre-allocate all objects used on the hot path to eliminate allocation latency:

use slab::Slab;

/// Fixed-capacity object pool with explicit recycling
pub struct OrderPool {
    /// Pre-allocated order slots
    orders: Slab<Order>,
    /// Maximum capacity (will reject, not grow)
    max_capacity: usize,
}

impl OrderPool {
    pub fn new(capacity: usize) -> Self {
        let mut orders = Slab::with_capacity(capacity);
        let mut keys = Vec::with_capacity(capacity);

        // Fill slab completely to fault pages
        for _ in 0..capacity {
            // Insert default Order - this faults the page
            let key = orders.insert(Order::default());
            keys.push(key);

            // Write a valid Order value to ensure pages are faulted
            // (using safe Rust, no UB from zero-writing typed memory)
            if let Some(slot) = orders.get_mut(key) {
                *slot = Order::default();
            }
        }

        // Note: mlock() on Slab is tricky because Slab uses Entry<T> enum
        // internally, not contiguous T storage. For production, consider:
        // - Using a custom allocator with mlock
        // - Using a Vec-backed pool instead of Slab
        // - Accepting that pages may be swapped (acceptable for many use cases)

        // Release all slots (memory remains faulted in kernel page tables)
        for key in keys {
            orders.remove(key);
        }

        Self { orders, max_capacity: capacity }
    }

    /// Allocate from pool; returns None if at capacity (never allocates)
    #[inline]
    pub fn allocate(&mut self) -> Option<usize> {
        if self.orders.len() >= self.max_capacity {
            return None;  // Reject rather than allocate
        }
        Some(self.orders.insert(Order::default()))
    }

    /// Return to pool for reuse
    #[inline]
    pub fn release(&mut self, key: usize) {
        self.orders.remove(key);
    }

    /// Get mutable reference to pooled order
    #[inline]
    pub fn get_mut(&mut self, key: usize) -> Option<&mut Order> {
        self.orders.get_mut(key)
    }
}

Rationale: - Slab provides O(1) insert/remove with stable keys - Fixed capacity prevents unbounded memory growth - Pre-warming ensures pages are faulted in before trading starts - Returns None on capacity exhaustion instead of allocating (fail-fast)

Pool Sizing Guidelines:

Pool	Recommended Capacity	Rationale
OrderPool	1000	Peak concurrent orders across both exchanges
MarketDataPool	100	Buffer for market data messages
ArbitrageOpportunityPool	50	Detected opportunities awaiting execution

4. Cache-Line Alignment¶

Prevent false sharing by aligning hot data structures to cache line boundaries:

/// Cache-line aligned wrapper for hot path data
#[repr(C, align(64))]
pub struct CacheAligned<T> {
    pub value: T,
}

impl<T> CacheAligned<T> {
    pub fn new(value: T) -> Self {
        Self { value }
    }
}

/// Example: per-thread counters without false sharing
pub struct ThreadCounters {
    pub messages_processed: CacheAligned<AtomicU64>,
    pub orders_submitted: CacheAligned<AtomicU64>,
    pub latency_sum_ns: CacheAligned<AtomicU64>,
}

impl ThreadCounters {
    pub fn new() -> Self {
        Self {
            messages_processed: CacheAligned::new(AtomicU64::new(0)),
            orders_submitted: CacheAligned::new(AtomicU64::new(0)),
            latency_sum_ns: CacheAligned::new(AtomicU64::new(0)),
        }
    }
}

Rationale: - x86_64 cache lines are 64 bytes - Without alignment, writes to adjacent atomics cause cache line bouncing - #[repr(C)] ensures predictable field layout - Padding is implicit from alignment requirement

Thread Architecture Diagram¶

┌────────────────────────────────────────────────────────────────┐
│                        Core 0 (Pinned)                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                  Market Data Thread                       │  │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                │  │
│  │  │WebSocket│──>│ Busy    │──>│ Parse & │──> To Arbiter  │  │
│  │  │ Recv    │   │ Poll    │   │ Validate│                │  │
│  │  └─────────┘   └─────────┘   └─────────┘                │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│                        Core 1 (Pinned)                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                   Execution Thread                        │  │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                │  │
│  │  │Arbiter  │──>│ Order   │──>│ Submit  │──> Exchange    │  │
│  │  │ Detect  │   │ Pool    │   │ Order   │                │  │
│  │  └─────────┘   └─────────┘   └─────────┘                │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│                    Core 2 (Unpinned, Lower Priority)            │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                   Metrics Thread                          │  │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐                │  │
│  │  │ SPSC    │──>│Histogram│──>│Prometheus│               │  │
│  │  │Consumer │   │ Merge   │   │ Export  │                │  │
│  │  └─────────┘   └─────────┘   └─────────┘                │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Implementation Phases¶

Phase	Scope	Deliverables
1	Object pools	`OrderPool`, `MarketDataPool` with pre-warming
2	Cache alignment	`CacheAligned<T>` wrapper, aligned counters
3	Busy-polling	Adaptive poll loop with configurable backoff
4	Thread affinity	Core pinning with fail-loud semantics

Consequences¶

Positive¶

Predictable latency - No allocation jitter on hot path
Reduced tail latency - Core pinning eliminates scheduler-induced spikes
Lower p99.99 - Cache alignment prevents false sharing stalls
Configurable trade-offs - Busy-poll parameters tunable per deployment

Negative¶

Increased complexity - Manual memory management with pools
Resource dedication - Pinned cores unavailable for other work
Platform dependencies - Core affinity requires OS-specific code
Capacity planning - Pool sizes must be tuned for workload

Risks¶

Risk	Mitigation
Pool exhaustion under load	Alert on pool utilization > 80%; size for 2x expected peak
Core pinning fails silently	Panic on failure; verify in startup health check
Busy-polling wastes power	Adaptive backoff; configurable spin limit
Wrong pool capacity	Expose metrics; make capacity configurable

Alternatives Considered¶

Alternative 1: Tokio Runtime Only¶

Pro: Simpler async/await model
Con: Tokio's work-stealing scheduler adds latency variance
Decision: Use dedicated threads for hot path; Tokio for cold path

Alternative 2: Growing Arena (`typed_arena`)¶

Pro: Fast bump allocation
Con: Memory grows unboundedly; no recycling until arena dropped
Decision: Use fixed-size Slab pools with explicit recycling

Alternative 3: Lock-Free Allocator (jemalloc/mimalloc)¶

Pro: Lower allocation overhead than system allocator
Con: Still has overhead; doesn't eliminate allocation entirely
Decision: Avoid allocation entirely on hot path; use pools

Alternative 4: Always Busy-Poll (No Yield)¶

Pro: Lowest possible latency
Con: 100% CPU usage; thermal throttling risk; noisy neighbor impact
Decision: Adaptive backoff balances latency and resource usage

References¶

crossbeam crate - Fast lock-free channels for inter-thread communication
core_affinity crate
slab crate
Intel Optimization Manual - Cache Line Size
Linux thread_priority
Rust Performance Book