Skip to content

ADR 012: Performance Monitoring Architecture

Status

Accepted

Context

Arbiter-Bot is a low-latency statistical arbitrage engine where microsecond-level performance is critical. Arbitrage opportunities exist for milliseconds; slow execution means missed profits or adverse fills. We need comprehensive performance observability that:

  1. Captures microsecond-level latency distributions, not just averages
  2. Minimizes measurement overhead on the hot path
  3. Identifies bottlenecks via profiling and metrics
  4. Operates in production without degrading trading performance

This ADR focuses on how to measure and observe performance. Architectural optimizations (thread affinity, memory pools, busy-polling) are covered in ADR-013.

Decision

Implement a multi-layer performance monitoring architecture with distinct strategies based on latency tolerance.

Layer 1: Hot Path Instrumentation (Microsecond-Sensitive)

The hot path (tick-to-trade) uses zero-allocation, non-blocking instrumentation:

Component Tool Rationale
Timing RDTSCP + LFENCE Serializing reads prevent CPU reordering; sub-nanosecond precision
Latency Recording hdrhistogram (per-thread) Captures full distribution including p99.99; merged periodically
Histogram Export SPSC ring buffer (snapshots only) Background thread consumes histogram snapshots, not individual samples
use std::arch::x86_64::{_rdtsc, _rdtscp, _mm_lfence};
use std::sync::atomic::{AtomicU64, Ordering};

/// Cache-line aligned to prevent false sharing
#[repr(C, align(64))]
pub struct HotPathMetrics {
    /// Pre-computed nanoseconds per TSC tick (avoids hot-path division)
    ns_per_tick: f64,
    _pad: [u8; 56],  // Pad to cache line
}

/// Per-thread histogram with double-buffering (no allocation on hot path)
pub struct ThreadLocalHistogram {
    /// Active histogram for recording (hot path writes here)
    active: hdrhistogram::Histogram<u64>,
    /// Spare histogram for swap (pre-allocated, avoids hot-path allocation)
    spare: hdrhistogram::Histogram<u64>,
    sample_count: u64,
}

impl ThreadLocalHistogram {
    pub fn new() -> Self {
        Self {
            active: hdrhistogram::Histogram::new(3).unwrap(),
            spare: hdrhistogram::Histogram::new(3).unwrap(),
            sample_count: 0,
        }
    }
}

impl HotPathMetrics {
    /// Serializing timestamp read at start of measurement
    #[inline(always)]
    pub fn record_start() -> u64 {
        unsafe {
            _mm_lfence();  // Serialize: ensure prior instructions complete
            _rdtsc()
        }
    }

    /// Serializing timestamp read at end of measurement
    #[inline(always)]
    pub fn record_end_tsc() -> u64 {
        let mut aux: u32 = 0;
        unsafe {
            let tsc = _rdtscp(&mut aux);  // RDTSCP serializes prior instructions
            _mm_lfence();  // Prevent subsequent instructions from moving up
            tsc
        }
    }
}

impl ThreadLocalHistogram {
    /// Hot path: record to active histogram (no cross-thread ops)
    #[inline(always)]
    pub fn record(&mut self, start_tsc: u64, metrics: &HotPathMetrics) {
        let end_tsc = HotPathMetrics::record_end_tsc();
        let elapsed_tsc = end_tsc.wrapping_sub(start_tsc);

        // Multiplication (not division) for conversion
        let nanos = (elapsed_tsc as f64 * metrics.ns_per_tick) as u64;

        // Record to thread-local active histogram (no lock, no contention)
        let _ = self.active.record(nanos);
        self.sample_count += 1;
    }

    /// Called by the OWNING THREAD at natural batch boundaries.
    /// Swaps active/spare histograms (O(1), NO ALLOCATION).
    /// IMPORTANT: Does NOT compute quantiles on hot path (O(N) operation).
    #[inline]
    pub fn maybe_export_histogram(
        &mut self,
        histogram_tx: &mut rtrb::Producer<ExportedHistogram>,
        export_interval_samples: u64,
    ) {
        if self.sample_count >= export_interval_samples {
            // Reset spare histogram (fast, just clears counts)
            self.spare.reset();

            // Swap active <-> spare (O(1) pointer swap, NO ALLOCATION)
            std::mem::swap(&mut self.active, &mut self.spare);

            let count = self.sample_count;
            self.sample_count = 0;

            // Clone histogram to send to background thread.
            // Trade-off: clone() is O(bucket_count) but happens infrequently (every N samples).
            // Hot path record() remains O(1). For N=1000, export overhead is amortized.
            // Alternative: triple-buffering avoids clone but adds complexity.
            //
            // NOTE: This clone is a known deviation from "zero allocation on every sample".
            // The allocation happens once per export_interval_samples (e.g., every 1000 samples).
            // For stricter requirements, use triple-buffering or raw sample ring buffers.
            if histogram_tx.push(ExportedHistogram {
                histogram: self.spare.clone(),
                count
            }).is_err() {
                // Increment dropped export counter for observability
                // (In production, this would be an atomic counter exposed to metrics)
                // DROPPED_EXPORTS.fetch_add(1, Ordering::Relaxed);
            }
        }
    }
}

/// Raw histogram sent to background aggregator (quantiles computed there, not on hot path)
pub struct ExportedHistogram {
    pub histogram: hdrhistogram::Histogram<u64>,
    pub count: u64,
}

/// Background aggregator computes quantiles (O(N) is acceptable here)
pub fn aggregate_histograms(rx: &mut rtrb::Consumer<ExportedHistogram>) -> Option<HistogramSnapshot> {
    let mut merged = hdrhistogram::Histogram::<u64>::new(3).unwrap();
    let mut total_count = 0u64;

    while let Ok(exported) = rx.pop() {
        merged.add(&exported.histogram).ok();
        total_count += exported.count;
    }

    if total_count > 0 {
        Some(HistogramSnapshot {
            p50: merged.value_at_quantile(0.50),
            p99: merged.value_at_quantile(0.99),
            p999: merged.value_at_quantile(0.999),
            p9999: merged.value_at_quantile(0.9999),
            count: total_count,
        })
    } else {
        None
    }
}

/// Final snapshot with computed quantiles (computed by background thread)
pub struct HistogramSnapshot {
    pub p50: u64,
    pub p99: u64,
    pub p999: u64,
    pub p9999: u64,
    pub count: u64,
}

Key Design Decisions: - Hot path records to histogram only - no ring buffer push per sample (avoids cache contention) - Owning thread exports raw histograms - swaps histogram with fresh one (O(1)) and sends to background; no quantile computation on hot path (value_at_quantile is O(N)) - Background thread computes quantiles - expensive O(N) iteration happens off the critical path - Use LFENCE before RDTSC or RDTSCP to prevent instruction reordering - Pre-compute ns_per_tick at startup (multiply instead of divide on hot path) - SPSC ring buffer (rtrb) for histogram export to background aggregator

Layer 2: Warm Path Instrumentation (Millisecond-Tolerant)

For operations that can tolerate slight overhead (connection management, position tracking):

Component Tool Rationale
Structured Logging tracing crate Async-safe, span-based context
Subscriber Non-blocking writer Prevents I/O blocking
Metrics Export Prometheus exposition format Industry standard, scrape-based
use tracing::{instrument, info_span, Instrument};

impl ArbiterActor {
    #[instrument(skip(self), fields(market_id = %update.market_id))]
    pub async fn process_market_update(&mut self, update: MarketUpdate) {
        let span = info_span!("arbitrage_detection");

        async {
            // Detection logic
        }
        .instrument(span)
        .await
    }
}

Layer 3: System-Level Profiling (Development/Diagnostics)

For deep performance analysis during development and incident investigation:

Tool Purpose When Used
perf CPU profiling, cache miss analysis Development, post-incident
coz Causal profiling for optimization prioritization Performance tuning sprints
lttng Kernel/user space transition analysis Investigating syscall overhead
flamegraph Visual call stack analysis Identifying hot functions

Platform Portability

TSC is x86_64-specific. For ARM and development environments:

#[cfg(target_arch = "x86_64")]
mod timing {
    use std::arch::x86_64::{_rdtsc, _rdtscp, _mm_lfence};

    #[inline(always)]
    pub fn read_start() -> u64 {
        unsafe { _mm_lfence(); _rdtsc() }
    }

    #[inline(always)]
    pub fn read_end() -> u64 {
        let mut aux: u32 = 0;
        unsafe { _rdtscp(&mut aux) }
    }
}

#[cfg(target_arch = "aarch64")]
mod timing {
    use std::sync::OnceLock;

    /// Timer frequency in Hz (read from CNTFRQ_EL0 at startup)
    static FREQ_HZ: OnceLock<u64> = OnceLock::new();

    /// Read timer frequency from hardware register
    fn get_freq_hz() -> u64 {
        *FREQ_HZ.get_or_init(|| {
            let freq: u64;
            unsafe { std::arch::asm!("mrs {}, cntfrq_el0", out(reg) freq) };
            freq
        })
    }

    /// Nanoseconds per tick (pre-computed for fast conversion)
    pub fn ns_per_tick() -> f64 {
        1_000_000_000.0 / get_freq_hz() as f64
    }

    #[inline(always)]
    pub fn read_start() -> u64 {
        // Single asm block prevents compiler from reordering ISB and MRS
        let cnt: u64;
        unsafe {
            std::arch::asm!(
                "isb",           // Ensure prior instructions complete
                "mrs {}, cntvct_el0",  // Read timer
                out(reg) cnt,
                options(nostack, nomem)
            );
        }
        cnt
    }

    #[inline(always)]
    pub fn read_end() -> u64 {
        // Single asm block prevents compiler from reordering
        let cnt: u64;
        unsafe {
            std::arch::asm!(
                "isb",           // Ensure measured code has completed
                "mrs {}, cntvct_el0",  // Read timer
                out(reg) cnt,
                options(nostack, nomem)
            );
        }
        cnt
    }
}

#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
mod timing {
    use std::sync::OnceLock;
    use std::time::Instant;

    /// Static epoch for monotonic timestamps (initialized on first use)
    static EPOCH: OnceLock<Instant> = OnceLock::new();

    fn epoch() -> &'static Instant {
        EPOCH.get_or_init(Instant::now)
    }

    // Fallback: ~20-30ns overhead via vDSO, but portable
    pub fn read_start() -> u64 {
        epoch().elapsed().as_nanos() as u64
    }

    pub fn read_end() -> u64 {
        epoch().elapsed().as_nanos() as u64
    }
}

Key Performance Indicators (KPIs)

KPI Target Alert Threshold
Tick-to-Trade p50 < 100 μs > 500 μs
Tick-to-Trade p99 < 500 μs > 2 ms
Tick-to-Trade p99.99 < 2 ms > 10 ms
Event Loop Lag < 1 ms > 5 ms
Lock Contention Rate < 0.1% > 1%
CPU Cache Miss Rate < 5% > 15%
Kernel Context Switches/sec < 1000 > 5000
Missed Snapshot Intervals 0 > 10/min

Metrics Export Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Hot Path Thread                          │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                │
│  │ RDTSCP   │──>│Per-Thread│──>│ Export   │                │
│  │ + LFENCE │   │Histogram │   │ Snapshot │                │
│  └──────────┘   └──────────┘   └──────────┘                │
│                                      │                       │
│  (Owning thread calls maybe_export_snapshot()               │
│   at batch boundaries - no cross-thread histogram access)   │
│                                      │                       │
└──────────────────────────────────────┼───────────────────────┘
                                       │ (SPSC ring buffer)
┌─────────────────────────────────────────────────────────────┐
│                   Metrics Aggregator                         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                │
│  │ Merged   │   │ Dropped  │   │ Gauges   │                │
│  │Histogram │   │ Counter  │   │          │                │
│  └──────────┘   └──────────┘   └──────────┘                │
│                       │                                      │
└───────────────────────┼──────────────────────────────────────┘
         ┌──────────────┼──────────────┐
         ▼              ▼              ▼
    ┌─────────┐   ┌─────────┐   ┌─────────┐
    │Prometheus│   │Grafana  │   │AlertMgr │
    │ Scrape  │   │Dashboard│   │         │
    └─────────┘   └─────────┘   └─────────┘

Implementation Phases

Phase Scope Deliverables
1 Hot path metrics TSC timing with serialization, per-thread HDR histograms, SPSC ring buffers
2 Warm path tracing tracing integration, non-blocking subscriber
3 Metrics export Prometheus endpoint, Grafana dashboards
4 Alerting Alert rules for KPI thresholds

Consequences

Positive

  • Sub-microsecond measurement precision via serialized TSC timestamps
  • Full latency distribution visibility with HDR histograms (p99.99)
  • Production-safe profiling with non-blocking instrumentation
  • Actionable optimization data via causal profiling
  • No hot-path contention - histogram snapshots exported asynchronously

Negative

  • Increased complexity in metrics infrastructure
  • Platform-specific code (TSC/RDTSCP is x86-specific; ARM needs CNTVCT_EL0)
  • Memory overhead from histograms (one per thread)
  • Operational burden of maintaining dashboards and alerts

Risks

Risk Mitigation
TSC frequency varies across cores Calibrate at startup; pin threads to prevent migration (see ADR-013)
Missed snapshot intervals Size snapshot interval appropriately (1s default); alert on missed intervals
Histogram memory per thread Bound histogram range (e.g., 1μs - 10s); limit precision bits

Alternatives Considered

Alternative 1: Standard std::time::Instant

  • Pro: Cross-platform, simple API
  • Con: On Linux, uses vDSO (not syscall) with ~20-30ns overhead; still slower than TSC (~5ns)
  • Decision: Use for warm path and fallback; TSC for hot path on x86_64

Alternative 2: Full tracing Instrumentation Everywhere

  • Pro: Consistent API, excellent tooling
  • Con: Even with non-blocking subscriber, has measurable overhead (~100ns per span)
  • Decision: Use for warm path only; TSC + ring buffer for hot path

Alternative 3: Sampling-Based Profiling Only

  • Pro: Very low overhead
  • Con: Misses tail latency events, insufficient for p99.99 analysis
  • Decision: Use continuous recording with HDR histograms instead

References