Skip to content

Implementing Low-Latency Performance Infrastructure

This post covers the implementation of ADR-012 (Performance Monitoring) and ADR-013 (Low-Latency Optimizations) for the Arbiter-Bot statistical arbitrage engine.

The Problem

Arbitrage opportunities exist for milliseconds. Slow execution means missed profits or adverse fills. We needed:

  1. Microsecond-precision timing that doesn't degrade the hot path
  2. Full latency distribution capture (p99.99, not just averages)
  3. Zero-allocation recording on the critical path
  4. Consistent scheduling to eliminate jitter

Platform-Specific Timing

Standard Instant::now() has ~20-30ns overhead on Linux (via vDSO). For hot path timing, we use platform-specific instructions:

x86_64: RDTSCP provides a serializing timestamp read. We pair it with LFENCE to prevent instruction reordering:

pub fn read_start() -> Timestamp {
    unsafe {
        core::arch::x86_64::_mm_lfence();  // Serialize prior instructions
        let tsc = core::arch::x86_64::_rdtsc();
        Timestamp { tsc }
    }
}

pub fn read_end() -> Timestamp {
    let mut _aux: u32 = 0;
    unsafe {
        let tsc = core::arch::x86_64::__rdtscp(&mut _aux);  // Self-serializing
        core::arch::x86_64::_mm_lfence();  // Prevent subsequent reordering
        Timestamp { tsc }
    }
}

ARM (aarch64): Uses CNTVCT_EL0 counter with ISB barriers for serialization:

pub fn read_start() -> Timestamp {
    let cnt: u64;
    unsafe {
        core::arch::asm!(
            "isb",                    // Instruction sync barrier
            "mrs {cnt}, cntvct_el0",  // Read timer
            cnt = out(reg) cnt,
            options(nostack, nomem, preserves_flags)
        );
    }
    Timestamp { cnt }
}

Fallback: For other platforms or Miri testing, we use std::time::Instant.

Double-Buffered Histograms

Recording to a single histogram creates contention when exporting. Our solution: double-buffering.

pub struct ThreadLocalHistogram {
    active: UnsafeCell<Histogram<u64>>,   // Hot path writes here
    spare: UnsafeCell<Histogram<u64>>,    // Pre-allocated for swap
    sample_count: UnsafeCell<u64>,
    producer: UnsafeCell<Producer<HistogramExport>>,
}

Recording: O(1) write to the active histogram, no cross-thread operations.

Export: Swap active/spare (O(1) pointer swap), send the old active to a background aggregator via SPSC ring buffer. The swap happens at natural batch boundaries, not on every sample.

The key insight: quantile computation (value_at_quantile) is O(N) and must happen off the hot path. The background thread handles aggregation and quantile calculation.

Object Pool Design

Dynamic allocation on the hot path causes unpredictable pauses. We use fixed-size Slab pools:

pub struct ObjectPool<T> {
    slab: Slab<T>,
    capacity: usize,
    free_list: Vec<usize>,
}

Pre-warming: At startup, allocate all slots to fault pages into memory, then release them to the free list. This ensures no page faults during trading.

Fail-fast: When exhausted, return Err(PoolExhausted) instead of allocating. Better to reject an order than introduce unpredictable latency.

Busy-Polling with Adaptive Backoff

std::sync::mpsc has ~100-300ns overhead per operation. We use crossbeam::channel (~20-50ns) with busy-polling:

pub fn recv(&self) -> Option<T> {
    // Phase 1: Spin
    for _ in 0..self.config.spin_iterations {
        match self.receiver.try_recv() {
            Ok(msg) => return Some(msg),
            Err(TryRecvError::Empty) => spin_loop(),
            Err(TryRecvError::Disconnected) => return None,
        }
    }
    // Phase 2: Yield and block
    self.receiver.recv().ok()
}

Spinning keeps the thread hot and ready. Adaptive backoff (configurable spin count, then yield) balances latency against power consumption.

Cache-Line Alignment

False sharing occurs when threads write to different variables that share a cache line. Our wrapper ensures 64-byte alignment:

#[repr(C, align(64))]
pub struct CacheAligned<T> {
    value: T,
}

This is critical for per-thread counters and metrics that are written frequently.

Thread Affinity

Core migration invalidates caches and causes TSC drift (frequencies can vary between cores). We pin critical threads:

pub fn pin_to_core(core_id: usize) -> Result<(), AffinityError> {
    if core_affinity::set_for_current(CoreId { id: core_id }) {
        Ok(())
    } else {
        Err(AffinityError { message: format!("Failed to pin to core {}", core_id) })
    }
}

Fail-loud semantics: If pinning fails, we error immediately rather than silently degrading performance.

Test Coverage

The implementation includes 17 new tests covering:

  • Timing monotonicity and reasonableness
  • Histogram recording, export, and buffer swapping
  • Aggregator merge and quantile computation
  • Cache-line alignment verification
  • Pool allocation, release, and exhaustion
  • Busy-poll message processing and adaptive backoff
  • Affinity configuration validation

What's Next

This implementation covers Phase 1 of ADR-012 (hot path instrumentation). Future phases include:

  • Phase 2: tracing integration for warm path
  • Phase 3: Prometheus metrics endpoint
  • Phase 4: Alert rules for KPI thresholds

Integration with the existing actors (ExecutionActor, ArbiterActor) is out of scope for this PR but follows naturally from the modular design.

References