Implementing Low-Latency Performance Infrastructure¶
This post covers the implementation of ADR-012 (Performance Monitoring) and ADR-013 (Low-Latency Optimizations) for the Arbiter-Bot statistical arbitrage engine.
The Problem¶
Arbitrage opportunities exist for milliseconds. Slow execution means missed profits or adverse fills. We needed:
- Microsecond-precision timing that doesn't degrade the hot path
- Full latency distribution capture (p99.99, not just averages)
- Zero-allocation recording on the critical path
- Consistent scheduling to eliminate jitter
Platform-Specific Timing¶
Standard Instant::now() has ~20-30ns overhead on Linux (via vDSO). For hot path timing, we use platform-specific instructions:
x86_64: RDTSCP provides a serializing timestamp read. We pair it with LFENCE to prevent instruction reordering:
pub fn read_start() -> Timestamp {
unsafe {
core::arch::x86_64::_mm_lfence(); // Serialize prior instructions
let tsc = core::arch::x86_64::_rdtsc();
Timestamp { tsc }
}
}
pub fn read_end() -> Timestamp {
let mut _aux: u32 = 0;
unsafe {
let tsc = core::arch::x86_64::__rdtscp(&mut _aux); // Self-serializing
core::arch::x86_64::_mm_lfence(); // Prevent subsequent reordering
Timestamp { tsc }
}
}
ARM (aarch64): Uses CNTVCT_EL0 counter with ISB barriers for serialization:
pub fn read_start() -> Timestamp {
let cnt: u64;
unsafe {
core::arch::asm!(
"isb", // Instruction sync barrier
"mrs {cnt}, cntvct_el0", // Read timer
cnt = out(reg) cnt,
options(nostack, nomem, preserves_flags)
);
}
Timestamp { cnt }
}
Fallback: For other platforms or Miri testing, we use std::time::Instant.
Double-Buffered Histograms¶
Recording to a single histogram creates contention when exporting. Our solution: double-buffering.
pub struct ThreadLocalHistogram {
active: UnsafeCell<Histogram<u64>>, // Hot path writes here
spare: UnsafeCell<Histogram<u64>>, // Pre-allocated for swap
sample_count: UnsafeCell<u64>,
producer: UnsafeCell<Producer<HistogramExport>>,
}
Recording: O(1) write to the active histogram, no cross-thread operations.
Export: Swap active/spare (O(1) pointer swap), send the old active to a background aggregator via SPSC ring buffer. The swap happens at natural batch boundaries, not on every sample.
The key insight: quantile computation (value_at_quantile) is O(N) and must happen off the hot path. The background thread handles aggregation and quantile calculation.
Object Pool Design¶
Dynamic allocation on the hot path causes unpredictable pauses. We use fixed-size Slab pools:
Pre-warming: At startup, allocate all slots to fault pages into memory, then release them to the free list. This ensures no page faults during trading.
Fail-fast: When exhausted, return Err(PoolExhausted) instead of allocating. Better to reject an order than introduce unpredictable latency.
Busy-Polling with Adaptive Backoff¶
std::sync::mpsc has ~100-300ns overhead per operation. We use crossbeam::channel (~20-50ns) with busy-polling:
pub fn recv(&self) -> Option<T> {
// Phase 1: Spin
for _ in 0..self.config.spin_iterations {
match self.receiver.try_recv() {
Ok(msg) => return Some(msg),
Err(TryRecvError::Empty) => spin_loop(),
Err(TryRecvError::Disconnected) => return None,
}
}
// Phase 2: Yield and block
self.receiver.recv().ok()
}
Spinning keeps the thread hot and ready. Adaptive backoff (configurable spin count, then yield) balances latency against power consumption.
Cache-Line Alignment¶
False sharing occurs when threads write to different variables that share a cache line. Our wrapper ensures 64-byte alignment:
This is critical for per-thread counters and metrics that are written frequently.
Thread Affinity¶
Core migration invalidates caches and causes TSC drift (frequencies can vary between cores). We pin critical threads:
pub fn pin_to_core(core_id: usize) -> Result<(), AffinityError> {
if core_affinity::set_for_current(CoreId { id: core_id }) {
Ok(())
} else {
Err(AffinityError { message: format!("Failed to pin to core {}", core_id) })
}
}
Fail-loud semantics: If pinning fails, we error immediately rather than silently degrading performance.
Test Coverage¶
The implementation includes 17 new tests covering:
- Timing monotonicity and reasonableness
- Histogram recording, export, and buffer swapping
- Aggregator merge and quantile computation
- Cache-line alignment verification
- Pool allocation, release, and exhaustion
- Busy-poll message processing and adaptive backoff
- Affinity configuration validation
What's Next¶
This implementation covers Phase 1 of ADR-012 (hot path instrumentation). Future phases include:
- Phase 2:
tracingintegration for warm path - Phase 3: Prometheus metrics endpoint
- Phase 4: Alert rules for KPI thresholds
Integration with the existing actors (ExecutionActor, ArbiterActor) is out of scope for this PR but follows naturally from the modular design.