Performance Monitoring¶

Low-overhead instrumentation for microsecond-level latency measurement.

Overview¶

Arbiter-Bot uses a multi-layer performance monitoring architecture optimized for low-latency trading:

Layer	Use Case	Overhead	Tools
Hot Path	Tick-to-trade timing	~5ns	TSC, HDR Histogram
Warm Path	Position tracking, connection management	~100ns	`tracing` spans
System	Development profiling	Variable	`perf`, flamegraph

Hot Path Instrumentation¶

The hot path uses platform-specific CPU instructions for sub-nanosecond precision timing with minimal overhead.

Platform Support¶

Architecture	Instruction	Precision	Overhead
x86_64	RDTSCP + LFENCE	~1ns	~5ns
AArch64 (ARM)	CNTVCT_EL0 + ISB	~40ns	~10ns
Other	`std::time::Instant`	~20ns	~30ns

Basic Usage¶

use arbiter_engine::metrics::timing::{read_start, read_end, elapsed_nanos};

// Start timing
let start = read_start();

// ... your hot path code ...

// End timing
let end = read_end();

// Get elapsed nanoseconds
let nanos = elapsed_nanos(start, end);
println!("Operation took {} ns", nanos);

How TSC Timing Works¶

On x86_64, timing uses the Time Stamp Counter (TSC) with serializing instructions:

Start timing:

// LFENCE ensures all prior instructions complete before reading TSC
_mm_lfence();
let tsc = _rdtsc();

End timing:

// RDTSCP is self-serializing for prior instructions
let tsc = _rdtscp(&mut aux);
// LFENCE prevents subsequent instructions from executing early
_mm_lfence();

This prevents CPU instruction reordering from affecting measurements.

HDR Histogram Collection¶

Use HDR histograms for capturing full latency distributions including tail latencies (p99.99).

Thread-Local Histogram¶

Each hot path thread maintains its own histogram to avoid cross-thread contention:

use arbiter_engine::metrics::histogram::ThreadLocalHistogram;

// Create histogram (returns consumer for aggregation)
let (histogram, consumer) = ThreadLocalHistogram::new();

// Record latency (hot path - no allocation)
histogram.record(elapsed_nanos);

// Export periodically (swaps buffers, O(1))
if should_export {
    histogram.export();
}

Double-Buffering¶

The histogram uses double-buffering for lock-free export:

Active buffer: Receives hot path recordings
Spare buffer: Ready for swap on export

On export: 1. Clone active histogram for background thread 2. Swap active ↔ spare (O(1) pointer swap) 3. Reset new active buffer 4. Send clone to aggregator via SPSC ring buffer

Hot Path Thread                    Background Thread
┌──────────────┐                   ┌──────────────┐
│   record()   │                   │drain_and_merge()
│   record()   │                   │              │
│   record()   │                   │              │
│   export()   │──── SPSC ────────>│compute_quantiles()
│              │     Ring          │              │
│   record()   │     Buffer        │              │
│   record()   │                   │              │
└──────────────┘                   └──────────────┘

Background Aggregation¶

The MetricsCollector merges histograms from multiple threads:

use arbiter_engine::metrics::collector::MetricsCollector;
use std::time::Duration;

// Collect consumers from all hot path threads
let consumers = vec![consumer1, consumer2, consumer3];
let mut collector = MetricsCollector::new(consumers);

// Periodically drain and merge (background thread)
loop {
    collector.drain_and_merge();

    if let Some(stats) = collector.compute_quantiles() {
        println!("p50: {} ns", stats.p50);
        println!("p99: {} ns", stats.p99);
        println!("p99.99: {} ns", stats.p99_99);
        println!("samples: {}", stats.sample_count);
    }

    std::thread::sleep(Duration::from_secs(1));
}

Quantile Statistics¶

The QuantileStats struct provides:

Field	Description
`p50`	Median latency
`p99`	99^th percentile
`p99_9`	99.9^th percentile
`p99_99`	99.99^th percentile (tail latency)
`min`	Minimum observed
`max`	Maximum observed
`mean`	Average latency
`sample_count`	Total samples

Key Performance Indicators¶

Monitor these KPIs in production:

KPI	Target	Alert Threshold
Tick-to-Trade p50	< 100 μs	> 500 μs
Tick-to-Trade p99	< 500 μs	> 2 ms
Tick-to-Trade p99.99	< 2 ms	> 10 ms
Event Loop Lag	< 1 ms	> 5 ms
Missed Exports	0	> 10/min

Warm Path Instrumentation¶

For less latency-sensitive operations, use the tracing crate:

use tracing::{instrument, info_span, Instrument};

#[instrument(skip(self), fields(market_id = %update.market_id))]
pub async fn process_market_update(&mut self, update: MarketUpdate) {
    let span = info_span!("arbitrage_detection");

    async {
        // Detection logic
    }
    .instrument(span)
    .await
}

Configure a non-blocking subscriber to prevent I/O from blocking the event loop:

use tracing_subscriber::{fmt, EnvFilter};

tracing_subscriber::fmt()
    .with_env_filter(EnvFilter::from_default_env())
    .with_writer(std::io::stderr) // Non-blocking stderr
    .init();

System-Level Profiling¶

For deep performance analysis during development:

CPU Profiling with perf¶

# Record CPU samples
perf record -g --call-graph dwarf target/release/arbiter-engine

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Cache Analysis¶

# L1 cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads target/release/arbiter-engine

# LLC (Last Level Cache) misses
perf stat -e LLC-load-misses,LLC-loads target/release/arbiter-engine

Tools Reference¶

Tool	Purpose	When to Use
`perf`	CPU profiling, cache analysis	Performance tuning
`flamegraph`	Visual call stack analysis	Identifying hot functions
`coz`	Causal profiling	Prioritizing optimizations
`lttng`	Kernel/user space analysis	Investigating syscalls

Implementation Details¶

Histogram Configuration¶

The HDR histogram is configured for latency measurement:

const MAX_LATENCY_NS: u64 = 60_000_000_000; // 60 seconds max
const SIGFIGS: u8 = 3;                       // 3 significant figures

Histogram::new_with_max(MAX_LATENCY_NS, SIGFIGS)

This provides: - Range: 0 to 60 seconds - Precision: 0.1% relative error - Memory: ~18 KB per histogram

TSC Calibration¶

The TSC runs at a fixed frequency on modern CPUs. For accurate nanosecond conversion:

// Approximate: assumes ~3GHz TSC
let nanos = cycles / 3;

// For production accuracy, calibrate at startup:
// 1. Read TSC
// 2. Sleep for known duration
// 3. Read TSC again
// 4. Compute cycles_per_nanosecond

ARM Counter Frequency¶

On AArch64, read the counter frequency from hardware:

let freq: u64;
unsafe { std::arch::asm!("mrs {}, cntfrq_el0", out(reg) freq) };
let ns_per_tick = 1_000_000_000.0 / freq as f64;

Best Practices¶

Do¶

Use read_start() before and read_end() after the measured operation
Export histograms at natural batch boundaries (e.g., every 1000 samples)
Keep the background aggregator on a separate thread
Pre-allocate histograms at startup

Don't¶

Compute quantiles on the hot path (O(N) operation)
Share histograms between threads (use per-thread + aggregation)
Call system time functions (Instant::now()) on the hot path if possible
Record every sample to a channel (use batched histogram export)

ADR-012: Performance Monitoring - Architecture decision
Low-Latency Tuning - Thread affinity, busy-polling, memory pools
CLI Reference - Performance-related flags