ADR 012: Performance Monitoring Architecture¶

Status¶

Accepted

Context¶

Arbiter-Bot is a low-latency statistical arbitrage engine where microsecond-level performance is critical. Arbitrage opportunities exist for milliseconds; slow execution means missed profits or adverse fills. We need comprehensive performance observability that:

Captures microsecond-level latency distributions, not just averages
Minimizes measurement overhead on the hot path
Identifies bottlenecks via profiling and metrics
Operates in production without degrading trading performance

This ADR focuses on how to measure and observe performance. Architectural optimizations (thread affinity, memory pools, busy-polling) are covered in ADR-013.

Decision¶

Implement a multi-layer performance monitoring architecture with distinct strategies based on latency tolerance.

Layer 1: Hot Path Instrumentation (Microsecond-Sensitive)¶

The hot path (tick-to-trade) uses zero-allocation, non-blocking instrumentation:

Component	Tool	Rationale
Timing	RDTSCP + LFENCE	Serializing reads prevent CPU reordering; sub-nanosecond precision
Latency Recording	`hdrhistogram` (per-thread)	Captures full distribution including p99.99; merged periodically
Histogram Export	SPSC ring buffer (snapshots only)	Background thread consumes histogram snapshots, not individual samples

use std::arch::x86_64::{_rdtsc, _rdtscp, _mm_lfence};
use std::sync::atomic::{AtomicU64, Ordering};

/// Cache-line aligned to prevent false sharing
#[repr(C, align(64))]
pub struct HotPathMetrics {
    /// Pre-computed nanoseconds per TSC tick (avoids hot-path division)
    ns_per_tick: f64,
    _pad: [u8; 56],  // Pad to cache line
}

/// Per-thread histogram with double-buffering (no allocation on hot path)
pub struct ThreadLocalHistogram {
    /// Active histogram for recording (hot path writes here)
    active: hdrhistogram::Histogram<u64>,
    /// Spare histogram for swap (pre-allocated, avoids hot-path allocation)
    spare: hdrhistogram::Histogram<u64>,
    sample_count: u64,
}

impl ThreadLocalHistogram {
    pub fn new() -> Self {
        Self {
            active: hdrhistogram::Histogram::new(3).unwrap(),
            spare: hdrhistogram::Histogram::new(3).unwrap(),
            sample_count: 0,
        }
    }
}

impl HotPathMetrics {
    /// Serializing timestamp read at start of measurement
    #[inline(always)]
    pub fn record_start() -> u64 {
        unsafe {
            _mm_lfence();  // Serialize: ensure prior instructions complete
            _rdtsc()
        }
    }

    /// Serializing timestamp read at end of measurement
    #[inline(always)]
    pub fn record_end_tsc() -> u64 {
        let mut aux: u32 = 0;
        unsafe {
            let tsc = _rdtscp(&mut aux);  // RDTSCP serializes prior instructions
            _mm_lfence();  // Prevent subsequent instructions from moving up
            tsc
        }
    }
}

impl ThreadLocalHistogram {
    /// Hot path: record to active histogram (no cross-thread ops)
    #[inline(always)]
    pub fn record(&mut self, start_tsc: u64, metrics: &HotPathMetrics) {
        let end_tsc = HotPathMetrics::record_end_tsc();
        let elapsed_tsc = end_tsc.wrapping_sub(start_tsc);

        // Multiplication (not division) for conversion
        let nanos = (elapsed_tsc as f64 * metrics.ns_per_tick) as u64;

        // Record to thread-local active histogram (no lock, no contention)
        let _ = self.active.record(nanos);
        self.sample_count += 1;
    }

    /// Called by the OWNING THREAD at natural batch boundaries.
    /// Swaps active/spare histograms (O(1), NO ALLOCATION).
    /// IMPORTANT: Does NOT compute quantiles on hot path (O(N) operation).
    #[inline]
    pub fn maybe_export_histogram(
        &mut self,
        histogram_tx: &mut rtrb::Producer<ExportedHistogram>,
        export_interval_samples: u64,
    ) {
        if self.sample_count >= export_interval_samples {
            // Reset spare histogram (fast, just clears counts)
            self.spare.reset();

            // Swap active <-> spare (O(1) pointer swap, NO ALLOCATION)
            std::mem::swap(&mut self.active, &mut self.spare);

            let count = self.sample_count;
            self.sample_count = 0;

            // Clone histogram to send to background thread.
            // Trade-off: clone() is O(bucket_count) but happens infrequently (every N samples).
            // Hot path record() remains O(1). For N=1000, export overhead is amortized.
            // Alternative: triple-buffering avoids clone but adds complexity.
            //
            // NOTE: This clone is a known deviation from "zero allocation on every sample".
            // The allocation happens once per export_interval_samples (e.g., every 1000 samples).
            // For stricter requirements, use triple-buffering or raw sample ring buffers.
            if histogram_tx.push(ExportedHistogram {
                histogram: self.spare.clone(),
                count
            }).is_err() {
                // Increment dropped export counter for observability
                // (In production, this would be an atomic counter exposed to metrics)
                // DROPPED_EXPORTS.fetch_add(1, Ordering::Relaxed);
            }
        }
    }
}

/// Raw histogram sent to background aggregator (quantiles computed there, not on hot path)
pub struct ExportedHistogram {
    pub histogram: hdrhistogram::Histogram<u64>,
    pub count: u64,
}

/// Background aggregator computes quantiles (O(N) is acceptable here)
pub fn aggregate_histograms(rx: &mut rtrb::Consumer<ExportedHistogram>) -> Option<HistogramSnapshot> {
    let mut merged = hdrhistogram::Histogram::<u64>::new(3).unwrap();
    let mut total_count = 0u64;

    while let Ok(exported) = rx.pop() {
        merged.add(&exported.histogram).ok();
        total_count += exported.count;
    }

    if total_count > 0 {
        Some(HistogramSnapshot {
            p50: merged.value_at_quantile(0.50),
            p99: merged.value_at_quantile(0.99),
            p999: merged.value_at_quantile(0.999),
            p9999: merged.value_at_quantile(0.9999),
            count: total_count,
        })
    } else {
        None
    }
}

/// Final snapshot with computed quantiles (computed by background thread)
pub struct HistogramSnapshot {
    pub p50: u64,
    pub p99: u64,
    pub p999: u64,
    pub p9999: u64,
    pub count: u64,
}

Key Design Decisions: - Hot path records to histogram only - no ring buffer push per sample (avoids cache contention) - Owning thread exports raw histograms - swaps histogram with fresh one (O(1)) and sends to background; no quantile computation on hot path (value_at_quantile is O(N)) - Background thread computes quantiles - expensive O(N) iteration happens off the critical path - Use LFENCE before RDTSC or RDTSCP to prevent instruction reordering - Pre-compute ns_per_tick at startup (multiply instead of divide on hot path) - SPSC ring buffer (rtrb) for histogram export to background aggregator

Layer 2: Warm Path Instrumentation (Millisecond-Tolerant)¶

For operations that can tolerate slight overhead (connection management, position tracking):

Component	Tool	Rationale
Structured Logging	`tracing` crate	Async-safe, span-based context
Subscriber	Non-blocking writer	Prevents I/O blocking
Metrics Export	Prometheus exposition format	Industry standard, scrape-based

use tracing::{instrument, info_span, Instrument};

impl ArbiterActor {
    #[instrument(skip(self), fields(market_id = %update.market_id))]
    pub async fn process_market_update(&mut self, update: MarketUpdate) {
        let span = info_span!("arbitrage_detection");

        async {
            // Detection logic
        }
        .instrument(span)
        .await
    }
}

Layer 3: System-Level Profiling (Development/Diagnostics)¶

For deep performance analysis during development and incident investigation:

Tool	Purpose	When Used
`perf`	CPU profiling, cache miss analysis	Development, post-incident
`coz`	Causal profiling for optimization prioritization	Performance tuning sprints
`lttng`	Kernel/user space transition analysis	Investigating syscall overhead
`flamegraph`	Visual call stack analysis	Identifying hot functions

Platform Portability¶

TSC is x86_64-specific. For ARM and development environments:

#[cfg(target_arch = "x86_64")]
mod timing {
    use std::arch::x86_64::{_rdtsc, _rdtscp, _mm_lfence};

    #[inline(always)]
    pub fn read_start() -> u64 {
        unsafe { _mm_lfence(); _rdtsc() }
    }

    #[inline(always)]
    pub fn read_end() -> u64 {
        let mut aux: u32 = 0;
        unsafe { _rdtscp(&mut aux) }
    }
}

#[cfg(target_arch = "aarch64")]
mod timing {
    use std::sync::OnceLock;

    /// Timer frequency in Hz (read from CNTFRQ_EL0 at startup)
    static FREQ_HZ: OnceLock<u64> = OnceLock::new();

    /// Read timer frequency from hardware register
    fn get_freq_hz() -> u64 {
        *FREQ_HZ.get_or_init(|| {
            let freq: u64;
            unsafe { std::arch::asm!("mrs {}, cntfrq_el0", out(reg) freq) };
            freq
        })
    }

    /// Nanoseconds per tick (pre-computed for fast conversion)
    pub fn ns_per_tick() -> f64 {
        1_000_000_000.0 / get_freq_hz() as f64
    }

    #[inline(always)]
    pub fn read_start() -> u64 {
        // Single asm block prevents compiler from reordering ISB and MRS
        let cnt: u64;
        unsafe {
            std::arch::asm!(
                "isb",           // Ensure prior instructions complete
                "mrs {}, cntvct_el0",  // Read timer
                out(reg) cnt,
                options(nostack, nomem)
            );
        }
        cnt
    }

    #[inline(always)]
    pub fn read_end() -> u64 {
        // Single asm block prevents compiler from reordering
        let cnt: u64;
        unsafe {
            std::arch::asm!(
                "isb",           // Ensure measured code has completed
                "mrs {}, cntvct_el0",  // Read timer
                out(reg) cnt,
                options(nostack, nomem)
            );
        }
        cnt
    }
}

#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
mod timing {
    use std::sync::OnceLock;
    use std::time::Instant;

    /// Static epoch for monotonic timestamps (initialized on first use)
    static EPOCH: OnceLock<Instant> = OnceLock::new();

    fn epoch() -> &'static Instant {
        EPOCH.get_or_init(Instant::now)
    }

    // Fallback: ~20-30ns overhead via vDSO, but portable
    pub fn read_start() -> u64 {
        epoch().elapsed().as_nanos() as u64
    }

    pub fn read_end() -> u64 {
        epoch().elapsed().as_nanos() as u64
    }
}

Key Performance Indicators (KPIs)¶

KPI	Target	Alert Threshold
Tick-to-Trade p50	< 100 μs	> 500 μs
Tick-to-Trade p99	< 500 μs	> 2 ms
Tick-to-Trade p99.99	< 2 ms	> 10 ms
Event Loop Lag	< 1 ms	> 5 ms
Lock Contention Rate	< 0.1%	> 1%
CPU Cache Miss Rate	< 5%	> 15%
Kernel Context Switches/sec	< 1000	> 5000
Missed Snapshot Intervals	0	> 10/min

Metrics Export Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                     Hot Path Thread                          │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                │
│  │ RDTSCP   │──>│Per-Thread│──>│ Export   │                │
│  │ + LFENCE │   │Histogram │   │ Snapshot │                │
│  └──────────┘   └──────────┘   └──────────┘                │
│                                      │                       │
│  (Owning thread calls maybe_export_snapshot()               │
│   at batch boundaries - no cross-thread histogram access)   │
│                                      │                       │
└──────────────────────────────────────┼───────────────────────┘
                                       │ (SPSC ring buffer)
                                       ▼
┌─────────────────────────────────────────────────────────────┐
│                   Metrics Aggregator                         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                │
│  │ Merged   │   │ Dropped  │   │ Gauges   │                │
│  │Histogram │   │ Counter  │   │          │                │
│  └──────────┘   └──────────┘   └──────────┘                │
│                       │                                      │
└───────────────────────┼──────────────────────────────────────┘
                        │
         ┌──────────────┼──────────────┐
         ▼              ▼              ▼
    ┌─────────┐   ┌─────────┐   ┌─────────┐
    │Prometheus│   │Grafana  │   │AlertMgr │
    │ Scrape  │   │Dashboard│   │         │
    └─────────┘   └─────────┘   └─────────┘

Implementation Phases¶

Phase	Scope	Deliverables
1	Hot path metrics	TSC timing with serialization, per-thread HDR histograms, SPSC ring buffers
2	Warm path tracing	`tracing` integration, non-blocking subscriber
3	Metrics export	Prometheus endpoint, Grafana dashboards
4	Alerting	Alert rules for KPI thresholds

Consequences¶

Positive¶

Sub-microsecond measurement precision via serialized TSC timestamps
Full latency distribution visibility with HDR histograms (p99.99)
Production-safe profiling with non-blocking instrumentation
Actionable optimization data via causal profiling
No hot-path contention - histogram snapshots exported asynchronously

Negative¶

Increased complexity in metrics infrastructure
Platform-specific code (TSC/RDTSCP is x86-specific; ARM needs CNTVCT_EL0)
Memory overhead from histograms (one per thread)
Operational burden of maintaining dashboards and alerts

Risks¶

Risk	Mitigation
TSC frequency varies across cores	Calibrate at startup; pin threads to prevent migration (see ADR-013)
Missed snapshot intervals	Size snapshot interval appropriately (1s default); alert on missed intervals
Histogram memory per thread	Bound histogram range (e.g., 1μs - 10s); limit precision bits

Alternatives Considered¶

Alternative 1: Standard `std::time::Instant`¶

Pro: Cross-platform, simple API
Con: On Linux, uses vDSO (not syscall) with ~20-30ns overhead; still slower than TSC (~5ns)
Decision: Use for warm path and fallback; TSC for hot path on x86_64

Alternative 2: Full `tracing` Instrumentation Everywhere¶

Pro: Consistent API, excellent tooling
Con: Even with non-blocking subscriber, has measurable overhead (~100ns per span)
Decision: Use for warm path only; TSC + ring buffer for hot path

Alternative 3: Sampling-Based Profiling Only¶

Pro: Very low overhead
Con: Misses tail latency events, insufficient for p99.99 analysis
Decision: Use continuous recording with HDR histograms instead