Skip to content

Performance Monitoring

Low-overhead instrumentation for microsecond-level latency measurement.

Overview

Arbiter-Bot uses a multi-layer performance monitoring architecture optimized for low-latency trading:

Layer Use Case Overhead Tools
Hot Path Tick-to-trade timing ~5ns TSC, HDR Histogram
Warm Path Position tracking, connection management ~100ns tracing spans
System Development profiling Variable perf, flamegraph

Hot Path Instrumentation

The hot path uses platform-specific CPU instructions for sub-nanosecond precision timing with minimal overhead.

Platform Support

Architecture Instruction Precision Overhead
x86_64 RDTSCP + LFENCE ~1ns ~5ns
AArch64 (ARM) CNTVCT_EL0 + ISB ~40ns ~10ns
Other std::time::Instant ~20ns ~30ns

Basic Usage

use arbiter_engine::metrics::timing::{read_start, read_end, elapsed_nanos};

// Start timing
let start = read_start();

// ... your hot path code ...

// End timing
let end = read_end();

// Get elapsed nanoseconds
let nanos = elapsed_nanos(start, end);
println!("Operation took {} ns", nanos);

How TSC Timing Works

On x86_64, timing uses the Time Stamp Counter (TSC) with serializing instructions:

Start timing:

// LFENCE ensures all prior instructions complete before reading TSC
_mm_lfence();
let tsc = _rdtsc();

End timing:

// RDTSCP is self-serializing for prior instructions
let tsc = _rdtscp(&mut aux);
// LFENCE prevents subsequent instructions from executing early
_mm_lfence();

This prevents CPU instruction reordering from affecting measurements.

HDR Histogram Collection

Use HDR histograms for capturing full latency distributions including tail latencies (p99.99).

Thread-Local Histogram

Each hot path thread maintains its own histogram to avoid cross-thread contention:

use arbiter_engine::metrics::histogram::ThreadLocalHistogram;

// Create histogram (returns consumer for aggregation)
let (histogram, consumer) = ThreadLocalHistogram::new();

// Record latency (hot path - no allocation)
histogram.record(elapsed_nanos);

// Export periodically (swaps buffers, O(1))
if should_export {
    histogram.export();
}

Double-Buffering

The histogram uses double-buffering for lock-free export:

  1. Active buffer: Receives hot path recordings
  2. Spare buffer: Ready for swap on export

On export: 1. Clone active histogram for background thread 2. Swap active ↔ spare (O(1) pointer swap) 3. Reset new active buffer 4. Send clone to aggregator via SPSC ring buffer

Hot Path Thread                    Background Thread
┌──────────────┐                   ┌──────────────┐
│   record()   │                   │drain_and_merge()
│   record()   │                   │              │
│   record()   │                   │              │
│   export()   │──── SPSC ────────>│compute_quantiles()
│              │     Ring          │              │
│   record()   │     Buffer        │              │
│   record()   │                   │              │
└──────────────┘                   └──────────────┘

Background Aggregation

The MetricsCollector merges histograms from multiple threads:

use arbiter_engine::metrics::collector::MetricsCollector;
use std::time::Duration;

// Collect consumers from all hot path threads
let consumers = vec![consumer1, consumer2, consumer3];
let mut collector = MetricsCollector::new(consumers);

// Periodically drain and merge (background thread)
loop {
    collector.drain_and_merge();

    if let Some(stats) = collector.compute_quantiles() {
        println!("p50: {} ns", stats.p50);
        println!("p99: {} ns", stats.p99);
        println!("p99.99: {} ns", stats.p99_99);
        println!("samples: {}", stats.sample_count);
    }

    std::thread::sleep(Duration::from_secs(1));
}

Quantile Statistics

The QuantileStats struct provides:

Field Description
p50 Median latency
p99 99th percentile
p99_9 99.9th percentile
p99_99 99.99th percentile (tail latency)
min Minimum observed
max Maximum observed
mean Average latency
sample_count Total samples

Key Performance Indicators

Monitor these KPIs in production:

KPI Target Alert Threshold
Tick-to-Trade p50 < 100 μs > 500 μs
Tick-to-Trade p99 < 500 μs > 2 ms
Tick-to-Trade p99.99 < 2 ms > 10 ms
Event Loop Lag < 1 ms > 5 ms
Missed Exports 0 > 10/min

Warm Path Instrumentation

For less latency-sensitive operations, use the tracing crate:

use tracing::{instrument, info_span, Instrument};

#[instrument(skip(self), fields(market_id = %update.market_id))]
pub async fn process_market_update(&mut self, update: MarketUpdate) {
    let span = info_span!("arbitrage_detection");

    async {
        // Detection logic
    }
    .instrument(span)
    .await
}

Configure a non-blocking subscriber to prevent I/O from blocking the event loop:

use tracing_subscriber::{fmt, EnvFilter};

tracing_subscriber::fmt()
    .with_env_filter(EnvFilter::from_default_env())
    .with_writer(std::io::stderr) // Non-blocking stderr
    .init();

System-Level Profiling

For deep performance analysis during development:

CPU Profiling with perf

# Record CPU samples
perf record -g --call-graph dwarf target/release/arbiter-engine

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Cache Analysis

# L1 cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads target/release/arbiter-engine

# LLC (Last Level Cache) misses
perf stat -e LLC-load-misses,LLC-loads target/release/arbiter-engine

Tools Reference

Tool Purpose When to Use
perf CPU profiling, cache analysis Performance tuning
flamegraph Visual call stack analysis Identifying hot functions
coz Causal profiling Prioritizing optimizations
lttng Kernel/user space analysis Investigating syscalls

Implementation Details

Histogram Configuration

The HDR histogram is configured for latency measurement:

const MAX_LATENCY_NS: u64 = 60_000_000_000; // 60 seconds max
const SIGFIGS: u8 = 3;                       // 3 significant figures

Histogram::new_with_max(MAX_LATENCY_NS, SIGFIGS)

This provides: - Range: 0 to 60 seconds - Precision: 0.1% relative error - Memory: ~18 KB per histogram

TSC Calibration

The TSC runs at a fixed frequency on modern CPUs. For accurate nanosecond conversion:

// Approximate: assumes ~3GHz TSC
let nanos = cycles / 3;

// For production accuracy, calibrate at startup:
// 1. Read TSC
// 2. Sleep for known duration
// 3. Read TSC again
// 4. Compute cycles_per_nanosecond

ARM Counter Frequency

On AArch64, read the counter frequency from hardware:

let freq: u64;
unsafe { std::arch::asm!("mrs {}, cntfrq_el0", out(reg) freq) };
let ns_per_tick = 1_000_000_000.0 / freq as f64;

Best Practices

Do

  • Use read_start() before and read_end() after the measured operation
  • Export histograms at natural batch boundaries (e.g., every 1000 samples)
  • Keep the background aggregator on a separate thread
  • Pre-allocate histograms at startup

Don't

  • Compute quantiles on the hot path (O(N) operation)
  • Share histograms between threads (use per-thread + aggregation)
  • Call system time functions (Instant::now()) on the hot path if possible
  • Record every sample to a channel (use batched histogram export)