ADR 013: Low-Latency Optimizations¶
Status¶
Accepted
Context¶
Arbiter-Bot is a statistical arbitrage engine where tick-to-trade latency directly impacts profitability. Arbitrage opportunities exist for milliseconds; slow execution means missed profits or adverse fills.
While Rust's lack of garbage collection provides a strong foundation, achieving consistent sub-millisecond latency requires explicit optimization of:
- Thread scheduling - OS scheduler jitter can add milliseconds of latency
- Memory allocation - Dynamic allocation on the hot path causes unpredictable pauses
- CPU cache behavior - False sharing between threads degrades performance
- I/O patterns - Blocking on channel receives wastes latency budget
This ADR covers architectural optimizations to the trading engine. Performance measurement and observability are covered in ADR-012.
Decision¶
Implement four key optimizations for the trading hot path.
1. Thread Affinity with Core Pinning¶
Pin critical threads to dedicated CPU cores to eliminate scheduler migration and ensure consistent TSC readings:
use core_affinity::CoreId;
pub struct TradingThreadConfig {
/// Pin market data thread to dedicated core
pub market_data_core: usize,
/// Pin execution thread to dedicated core
pub execution_core: usize,
/// Pin metrics aggregator to separate core (lower priority)
pub metrics_core: usize,
}
impl TradingRuntime {
pub fn spawn_market_data_thread(config: &TradingThreadConfig, rx: Receiver<MarketData>) {
std::thread::spawn(move || {
// Pin to core; fail loudly if pinning fails
if !core_affinity::set_for_current(CoreId { id: config.market_data_core }) {
panic!(
"Failed to pin market data thread to core {}. \
TSC calibration assumes no core migration.",
config.market_data_core
);
}
// Set thread priority (requires CAP_SYS_NICE on Linux)
#[cfg(target_os = "linux")]
{
let _ = thread_priority::set_current_thread_priority(
thread_priority::ThreadPriority::Max
);
}
run_market_data_loop(rx);
});
}
}
Rationale: - TSC frequency can vary between cores; pinning ensures consistent timing - Eliminates context switch overhead from scheduler migrations - Reduces L1/L2 cache invalidation from core hopping - Fails loudly rather than silently degrading performance
2. Adaptive Busy-Polling¶
Use busy-polling with adaptive backoff for the market data receive loop.
Important: Use crossbeam::channel instead of std::sync::mpsc for microsecond-scale latency. The standard library's MPSC channel has ~100-300ns overhead per operation; crossbeam's bounded channel achieves ~20-50ns.
use crossbeam::channel::{Receiver, TryRecvError};
pub struct BusyPollConfig {
/// Maximum spin iterations before yielding (0 = always yield)
pub max_spin_iterations: u32,
/// Whether to use PAUSE instruction during spin (reduces power)
pub use_spin_hint: bool,
}
impl Default for BusyPollConfig {
fn default() -> Self {
Self {
max_spin_iterations: 1000, // ~1-10μs depending on CPU
use_spin_hint: true,
}
}
}
pub fn run_market_data_loop(rx: Receiver<MarketData>, config: BusyPollConfig) {
let mut spin_count = 0u32;
loop {
match rx.try_recv() {
Ok(msg) => {
process_market_data(msg);
spin_count = 0; // Reset on successful work
}
Err(TryRecvError::Empty) => {
if spin_count < config.max_spin_iterations {
spin_count += 1;
if config.use_spin_hint {
std::hint::spin_loop(); // PAUSE on x86, YIELD on ARM
}
} else {
// Yield to OS briefly to reduce power/thermal impact
std::thread::yield_now();
spin_count = 0;
}
}
Err(TryRecvError::Disconnected) => break,
}
}
}
Rationale:
- try_recv() avoids blocking syscall overhead
- Spinning keeps the thread hot and ready for immediate processing
- Adaptive backoff prevents 100% CPU usage during idle periods
- spin_loop() hint reduces power consumption while spinning
- Configurable to tune for latency vs. power trade-off
3. Fixed-Size Object Pools¶
Pre-allocate all objects used on the hot path to eliminate allocation latency:
use slab::Slab;
/// Fixed-capacity object pool with explicit recycling
pub struct OrderPool {
/// Pre-allocated order slots
orders: Slab<Order>,
/// Maximum capacity (will reject, not grow)
max_capacity: usize,
}
impl OrderPool {
pub fn new(capacity: usize) -> Self {
let mut orders = Slab::with_capacity(capacity);
let mut keys = Vec::with_capacity(capacity);
// Fill slab completely to fault pages
for _ in 0..capacity {
// Insert default Order - this faults the page
let key = orders.insert(Order::default());
keys.push(key);
// Write a valid Order value to ensure pages are faulted
// (using safe Rust, no UB from zero-writing typed memory)
if let Some(slot) = orders.get_mut(key) {
*slot = Order::default();
}
}
// Note: mlock() on Slab is tricky because Slab uses Entry<T> enum
// internally, not contiguous T storage. For production, consider:
// - Using a custom allocator with mlock
// - Using a Vec-backed pool instead of Slab
// - Accepting that pages may be swapped (acceptable for many use cases)
// Release all slots (memory remains faulted in kernel page tables)
for key in keys {
orders.remove(key);
}
Self { orders, max_capacity: capacity }
}
/// Allocate from pool; returns None if at capacity (never allocates)
#[inline]
pub fn allocate(&mut self) -> Option<usize> {
if self.orders.len() >= self.max_capacity {
return None; // Reject rather than allocate
}
Some(self.orders.insert(Order::default()))
}
/// Return to pool for reuse
#[inline]
pub fn release(&mut self, key: usize) {
self.orders.remove(key);
}
/// Get mutable reference to pooled order
#[inline]
pub fn get_mut(&mut self, key: usize) -> Option<&mut Order> {
self.orders.get_mut(key)
}
}
Rationale:
- Slab provides O(1) insert/remove with stable keys
- Fixed capacity prevents unbounded memory growth
- Pre-warming ensures pages are faulted in before trading starts
- Returns None on capacity exhaustion instead of allocating (fail-fast)
Pool Sizing Guidelines:
| Pool | Recommended Capacity | Rationale |
|---|---|---|
| OrderPool | 1000 | Peak concurrent orders across both exchanges |
| MarketDataPool | 100 | Buffer for market data messages |
| ArbitrageOpportunityPool | 50 | Detected opportunities awaiting execution |
4. Cache-Line Alignment¶
Prevent false sharing by aligning hot data structures to cache line boundaries:
/// Cache-line aligned wrapper for hot path data
#[repr(C, align(64))]
pub struct CacheAligned<T> {
pub value: T,
}
impl<T> CacheAligned<T> {
pub fn new(value: T) -> Self {
Self { value }
}
}
/// Example: per-thread counters without false sharing
pub struct ThreadCounters {
pub messages_processed: CacheAligned<AtomicU64>,
pub orders_submitted: CacheAligned<AtomicU64>,
pub latency_sum_ns: CacheAligned<AtomicU64>,
}
impl ThreadCounters {
pub fn new() -> Self {
Self {
messages_processed: CacheAligned::new(AtomicU64::new(0)),
orders_submitted: CacheAligned::new(AtomicU64::new(0)),
latency_sum_ns: CacheAligned::new(AtomicU64::new(0)),
}
}
}
Rationale:
- x86_64 cache lines are 64 bytes
- Without alignment, writes to adjacent atomics cause cache line bouncing
- #[repr(C)] ensures predictable field layout
- Padding is implicit from alignment requirement
Thread Architecture Diagram¶
┌────────────────────────────────────────────────────────────────┐
│ Core 0 (Pinned) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Market Data Thread │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │WebSocket│──>│ Busy │──>│ Parse & │──> To Arbiter │ │
│ │ │ Recv │ │ Poll │ │ Validate│ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ Core 1 (Pinned) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Execution Thread │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │Arbiter │──>│ Order │──>│ Submit │──> Exchange │ │
│ │ │ Detect │ │ Pool │ │ Order │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ Core 2 (Unpinned, Lower Priority) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Metrics Thread │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ SPSC │──>│Histogram│──>│Prometheus│ │ │
│ │ │Consumer │ │ Merge │ │ Export │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
Implementation Phases¶
| Phase | Scope | Deliverables |
|---|---|---|
| 1 | Object pools | OrderPool, MarketDataPool with pre-warming |
| 2 | Cache alignment | CacheAligned<T> wrapper, aligned counters |
| 3 | Busy-polling | Adaptive poll loop with configurable backoff |
| 4 | Thread affinity | Core pinning with fail-loud semantics |
Consequences¶
Positive¶
- Predictable latency - No allocation jitter on hot path
- Reduced tail latency - Core pinning eliminates scheduler-induced spikes
- Lower p99.99 - Cache alignment prevents false sharing stalls
- Configurable trade-offs - Busy-poll parameters tunable per deployment
Negative¶
- Increased complexity - Manual memory management with pools
- Resource dedication - Pinned cores unavailable for other work
- Platform dependencies - Core affinity requires OS-specific code
- Capacity planning - Pool sizes must be tuned for workload
Risks¶
| Risk | Mitigation |
|---|---|
| Pool exhaustion under load | Alert on pool utilization > 80%; size for 2x expected peak |
| Core pinning fails silently | Panic on failure; verify in startup health check |
| Busy-polling wastes power | Adaptive backoff; configurable spin limit |
| Wrong pool capacity | Expose metrics; make capacity configurable |
Alternatives Considered¶
Alternative 1: Tokio Runtime Only¶
- Pro: Simpler async/await model
- Con: Tokio's work-stealing scheduler adds latency variance
- Decision: Use dedicated threads for hot path; Tokio for cold path
Alternative 2: Growing Arena (typed_arena)¶
- Pro: Fast bump allocation
- Con: Memory grows unboundedly; no recycling until arena dropped
- Decision: Use fixed-size
Slabpools with explicit recycling
Alternative 3: Lock-Free Allocator (jemalloc/mimalloc)¶
- Pro: Lower allocation overhead than system allocator
- Con: Still has overhead; doesn't eliminate allocation entirely
- Decision: Avoid allocation entirely on hot path; use pools
Alternative 4: Always Busy-Poll (No Yield)¶
- Pro: Lowest possible latency
- Con: 100% CPU usage; thermal throttling risk; noisy neighbor impact
- Decision: Adaptive backoff balances latency and resource usage
References¶
- crossbeam crate - Fast lock-free channels for inter-thread communication
- core_affinity crate
- slab crate
- Intel Optimization Manual - Cache Line Size
- Linux thread_priority
- Rust Performance Book