Skip to content

Low-Latency Tuning

Architectural optimizations for sub-millisecond tick-to-trade latency.

Overview

Achieving consistent low latency requires optimizing four areas:

Area Problem Solution
Thread Scheduling OS scheduler jitter adds milliseconds Core pinning
Memory Allocation Dynamic allocation causes pauses Object pools
CPU Cache False sharing degrades performance Cache-line alignment
I/O Patterns Blocking receives waste latency budget Busy-polling

Thread Affinity

Pin critical threads to dedicated CPU cores to eliminate scheduler migration.

Configuration

use core_affinity::CoreId;

pub struct ThreadConfig {
    pub market_data_core: usize,  // e.g., 0
    pub execution_core: usize,    // e.g., 1
    pub metrics_core: usize,      // e.g., 2
}

Pinning a Thread

use core_affinity::CoreId;

std::thread::spawn(move || {
    // Pin thread to core - fail loudly if pinning fails
    if !core_affinity::set_for_current(CoreId { id: core_id }) {
        panic!("Failed to pin thread to core {}", core_id);
    }

    // Set high priority (requires CAP_SYS_NICE on Linux)
    #[cfg(target_os = "linux")]
    {
        use thread_priority::{ThreadPriority, set_current_thread_priority};
        let _ = set_current_thread_priority(ThreadPriority::Max);
    }

    // Run hot path loop
    run_loop();
});

Why Core Pinning Matters

Benefit Impact
Consistent TSC readings TSC frequency can vary between cores
No context switch overhead Eliminates scheduler migration latency
Better cache utilization L1/L2 cache stays hot
Predictable latency Removes scheduler-induced variance

Thread Architecture

Core 0 (Pinned)           Core 1 (Pinned)           Core 2 (Lower Priority)
┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│ Market Data     │──────>│ Execution       │       │ Metrics         │
│ WebSocket recv  │       │ Order submit    │       │ Aggregation     │
│ Parse & validate│       │ State machine   │       │ Prometheus      │
└─────────────────┘       └─────────────────┘       └─────────────────┘

Busy-Polling

Use non-blocking receives with adaptive backoff for minimal latency.

Why Not std::sync::mpsc?

Channel Latency Notes
std::sync::mpsc 100-300ns System allocator involvement
crossbeam::channel 20-50ns Lock-free, cache-friendly

Always use crossbeam::channel for hot path communication.

Adaptive Busy-Poll Loop

use crossbeam::channel::{Receiver, TryRecvError};

pub struct BusyPollConfig {
    /// Max spin iterations before yielding
    pub max_spin_iterations: u32,
    /// Use PAUSE instruction during spin
    pub use_spin_hint: bool,
}

impl Default for BusyPollConfig {
    fn default() -> Self {
        Self {
            max_spin_iterations: 1000,  // ~1-10us depending on CPU
            use_spin_hint: true,
        }
    }
}

pub fn run_loop<T>(rx: Receiver<T>, config: BusyPollConfig) {
    let mut spin_count = 0u32;

    loop {
        match rx.try_recv() {
            Ok(msg) => {
                process(msg);
                spin_count = 0;  // Reset on work
            }
            Err(TryRecvError::Empty) => {
                if spin_count < config.max_spin_iterations {
                    spin_count += 1;
                    if config.use_spin_hint {
                        std::hint::spin_loop();  // PAUSE on x86
                    }
                } else {
                    std::thread::yield_now();  // Brief yield
                    spin_count = 0;
                }
            }
            Err(TryRecvError::Disconnected) => break,
        }
    }
}

Spin Loop Behavior

Phase Action CPU Usage
Active try_recv() succeeds Normal
Spinning spin_loop() hint 100% but power-aware
Yielding yield_now() Reduced

The spin_loop() hint: - x86: PAUSE instruction (reduces power, prevents pipeline stalls) - ARM: YIELD instruction (hints to processor)

Object Pools

Pre-allocate all hot path objects to eliminate allocation latency.

Fixed-Size Pool with Slab

use slab::Slab;

pub struct OrderPool {
    orders: Slab<Order>,
    max_capacity: usize,
}

impl OrderPool {
    pub fn new(capacity: usize) -> Self {
        let mut orders = Slab::with_capacity(capacity);

        // Pre-warm: fill and release to fault pages
        let keys: Vec<_> = (0..capacity)
            .map(|_| orders.insert(Order::default()))
            .collect();

        for key in keys {
            orders.remove(key);
        }

        Self { orders, max_capacity: capacity }
    }

    /// Allocate from pool (O(1), no allocation)
    #[inline]
    pub fn allocate(&mut self) -> Option<usize> {
        if self.orders.len() >= self.max_capacity {
            return None;  // Reject, don't grow
        }
        Some(self.orders.insert(Order::default()))
    }

    /// Return to pool for reuse (O(1))
    #[inline]
    pub fn release(&mut self, key: usize) {
        self.orders.remove(key);
    }

    /// Access pooled object
    #[inline]
    pub fn get_mut(&mut self, key: usize) -> Option<&mut Order> {
        self.orders.get_mut(key)
    }
}

Pool Sizing Guidelines

Pool Capacity Rationale
OrderPool 1000 Peak concurrent orders
MarketDataPool 100 Buffer for incoming messages
OpportunityPool 50 Detected arbitrage opportunities

Size pools for 2x expected peak to avoid exhaustion.

Pre-Warming

Pre-warming ensures memory pages are faulted before trading:

// Fill pool completely
for _ in 0..capacity {
    let key = orders.insert(Order::default());
    // Touch the memory to fault the page
    orders.get_mut(key).unwrap().price = 0.0;
}

// Release all slots (pages remain in memory)
for key in keys {
    orders.remove(key);
}

Cache-Line Alignment

Prevent false sharing by aligning data to 64-byte cache lines.

The Problem

Without alignment, adjacent atomics share a cache line:

// BAD: Both fields in same cache line
struct Counters {
    thread1_counter: AtomicU64,  // bytes 0-7
    thread2_counter: AtomicU64,  // bytes 8-15
}
// Thread 1 writes → Thread 2's cache line invalidated

The Solution

/// Cache-line aligned wrapper
#[repr(C, align(64))]
pub struct CacheAligned<T> {
    pub value: T,
}

impl<T> CacheAligned<T> {
    pub fn new(value: T) -> Self {
        Self { value }
    }
}

// GOOD: Each counter on its own cache line
struct Counters {
    thread1_counter: CacheAligned<AtomicU64>,  // bytes 0-63
    thread2_counter: CacheAligned<AtomicU64>,  // bytes 64-127
}

Per-Thread Counters Example

use std::sync::atomic::{AtomicU64, Ordering};

pub struct ThreadCounters {
    pub messages_processed: CacheAligned<AtomicU64>,
    pub orders_submitted: CacheAligned<AtomicU64>,
    pub latency_sum_ns: CacheAligned<AtomicU64>,
}

impl ThreadCounters {
    pub fn new() -> Self {
        Self {
            messages_processed: CacheAligned::new(AtomicU64::new(0)),
            orders_submitted: CacheAligned::new(AtomicU64::new(0)),
            latency_sum_ns: CacheAligned::new(AtomicU64::new(0)),
        }
    }

    #[inline]
    pub fn inc_messages(&self) {
        self.messages_processed.value.fetch_add(1, Ordering::Relaxed);
    }
}

Environment Configuration

Linux Performance Settings

# Disable CPU frequency scaling (run as root)
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Increase file descriptor limits
ulimit -n 65536

# Enable transparent huge pages (optional)
echo always > /sys/kernel/mm/transparent_hugepage/enabled

Thread Priority (CAP_SYS_NICE)

To use high thread priority on Linux:

# Option 1: setcap on binary
sudo setcap cap_sys_nice+ep target/release/arbiter-engine

# Option 2: Run as root (not recommended for production)

Benchmarking

Measure Baseline

# Run with default settings
cargo run --release -- --paper-trade

# Check metrics
curl localhost:9090/metrics | grep tick_to_trade

Compare Configurations

Configuration p50 p99 p99.99
Default (no optimizations) 200μs 2ms 15ms
+ Core pinning 150μs 800μs 5ms
+ Busy-polling 80μs 400μs 2ms
+ Object pools 70μs 300μs 1ms
+ Cache alignment 60μs 250μs 800μs

Values are illustrative; actual results depend on hardware and workload.

Best Practices

Do

  • Pin hot path threads to dedicated cores
  • Use crossbeam::channel instead of std::sync::mpsc
  • Pre-allocate all hot path objects
  • Align frequently-accessed atomics to cache lines
  • Fail loudly if core pinning fails

Don't

  • Allocate memory on the hot path
  • Use blocking channel receives
  • Share cache lines between threads
  • Spin indefinitely without yielding
  • Assume default thread scheduling is optimal

Troubleshooting

Core Pinning Fails

Failed to pin thread to core 0

Causes: - Core doesn't exist (check nproc) - Running in container without CPU affinity - Insufficient permissions

Solutions: - Verify core count: cat /proc/cpuinfo | grep processor - Docker: use --cpuset-cpus - Grant CAP_SYS_NICE capability

Pool Exhaustion

OrderPool: capacity exhausted (1000/1000)

Solutions: - Increase pool capacity - Investigate why orders aren't being released - Add monitoring for pool utilization

High Tail Latency Despite Optimizations

Check for: - Garbage collection in dependent services - Network latency variance - Disk I/O from logging - Other processes on pinned cores