Performance Monitoring¶
Low-overhead instrumentation for microsecond-level latency measurement.
Overview¶
Arbiter-Bot uses a multi-layer performance monitoring architecture optimized for low-latency trading:
| Layer | Use Case | Overhead | Tools |
|---|---|---|---|
| Hot Path | Tick-to-trade timing | ~5ns | TSC, HDR Histogram |
| Warm Path | Position tracking, connection management | ~100ns | tracing spans |
| System | Development profiling | Variable | perf, flamegraph |
Hot Path Instrumentation¶
The hot path uses platform-specific CPU instructions for sub-nanosecond precision timing with minimal overhead.
Platform Support¶
| Architecture | Instruction | Precision | Overhead |
|---|---|---|---|
| x86_64 | RDTSCP + LFENCE | ~1ns | ~5ns |
| AArch64 (ARM) | CNTVCT_EL0 + ISB | ~40ns | ~10ns |
| Other | std::time::Instant |
~20ns | ~30ns |
Basic Usage¶
use arbiter_engine::metrics::timing::{read_start, read_end, elapsed_nanos};
// Start timing
let start = read_start();
// ... your hot path code ...
// End timing
let end = read_end();
// Get elapsed nanoseconds
let nanos = elapsed_nanos(start, end);
println!("Operation took {} ns", nanos);
How TSC Timing Works¶
On x86_64, timing uses the Time Stamp Counter (TSC) with serializing instructions:
Start timing:
// LFENCE ensures all prior instructions complete before reading TSC
_mm_lfence();
let tsc = _rdtsc();
End timing:
// RDTSCP is self-serializing for prior instructions
let tsc = _rdtscp(&mut aux);
// LFENCE prevents subsequent instructions from executing early
_mm_lfence();
This prevents CPU instruction reordering from affecting measurements.
HDR Histogram Collection¶
Use HDR histograms for capturing full latency distributions including tail latencies (p99.99).
Thread-Local Histogram¶
Each hot path thread maintains its own histogram to avoid cross-thread contention:
use arbiter_engine::metrics::histogram::ThreadLocalHistogram;
// Create histogram (returns consumer for aggregation)
let (histogram, consumer) = ThreadLocalHistogram::new();
// Record latency (hot path - no allocation)
histogram.record(elapsed_nanos);
// Export periodically (swaps buffers, O(1))
if should_export {
histogram.export();
}
Double-Buffering¶
The histogram uses double-buffering for lock-free export:
- Active buffer: Receives hot path recordings
- Spare buffer: Ready for swap on export
On export: 1. Clone active histogram for background thread 2. Swap active ↔ spare (O(1) pointer swap) 3. Reset new active buffer 4. Send clone to aggregator via SPSC ring buffer
Hot Path Thread Background Thread
┌──────────────┐ ┌──────────────┐
│ record() │ │drain_and_merge()
│ record() │ │ │
│ record() │ │ │
│ export() │──── SPSC ────────>│compute_quantiles()
│ │ Ring │ │
│ record() │ Buffer │ │
│ record() │ │ │
└──────────────┘ └──────────────┘
Background Aggregation¶
The MetricsCollector merges histograms from multiple threads:
use arbiter_engine::metrics::collector::MetricsCollector;
use std::time::Duration;
// Collect consumers from all hot path threads
let consumers = vec![consumer1, consumer2, consumer3];
let mut collector = MetricsCollector::new(consumers);
// Periodically drain and merge (background thread)
loop {
collector.drain_and_merge();
if let Some(stats) = collector.compute_quantiles() {
println!("p50: {} ns", stats.p50);
println!("p99: {} ns", stats.p99);
println!("p99.99: {} ns", stats.p99_99);
println!("samples: {}", stats.sample_count);
}
std::thread::sleep(Duration::from_secs(1));
}
Quantile Statistics¶
The QuantileStats struct provides:
| Field | Description |
|---|---|
p50 |
Median latency |
p99 |
99th percentile |
p99_9 |
99.9th percentile |
p99_99 |
99.99th percentile (tail latency) |
min |
Minimum observed |
max |
Maximum observed |
mean |
Average latency |
sample_count |
Total samples |
Key Performance Indicators¶
Monitor these KPIs in production:
| KPI | Target | Alert Threshold |
|---|---|---|
| Tick-to-Trade p50 | < 100 μs | > 500 μs |
| Tick-to-Trade p99 | < 500 μs | > 2 ms |
| Tick-to-Trade p99.99 | < 2 ms | > 10 ms |
| Event Loop Lag | < 1 ms | > 5 ms |
| Missed Exports | 0 | > 10/min |
Warm Path Instrumentation¶
For less latency-sensitive operations, use the tracing crate:
use tracing::{instrument, info_span, Instrument};
#[instrument(skip(self), fields(market_id = %update.market_id))]
pub async fn process_market_update(&mut self, update: MarketUpdate) {
let span = info_span!("arbitrage_detection");
async {
// Detection logic
}
.instrument(span)
.await
}
Configure a non-blocking subscriber to prevent I/O from blocking the event loop:
use tracing_subscriber::{fmt, EnvFilter};
tracing_subscriber::fmt()
.with_env_filter(EnvFilter::from_default_env())
.with_writer(std::io::stderr) // Non-blocking stderr
.init();
System-Level Profiling¶
For deep performance analysis during development:
CPU Profiling with perf¶
# Record CPU samples
perf record -g --call-graph dwarf target/release/arbiter-engine
# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Cache Analysis¶
# L1 cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads target/release/arbiter-engine
# LLC (Last Level Cache) misses
perf stat -e LLC-load-misses,LLC-loads target/release/arbiter-engine
Tools Reference¶
| Tool | Purpose | When to Use |
|---|---|---|
perf |
CPU profiling, cache analysis | Performance tuning |
flamegraph |
Visual call stack analysis | Identifying hot functions |
coz |
Causal profiling | Prioritizing optimizations |
lttng |
Kernel/user space analysis | Investigating syscalls |
Implementation Details¶
Histogram Configuration¶
The HDR histogram is configured for latency measurement:
const MAX_LATENCY_NS: u64 = 60_000_000_000; // 60 seconds max
const SIGFIGS: u8 = 3; // 3 significant figures
Histogram::new_with_max(MAX_LATENCY_NS, SIGFIGS)
This provides: - Range: 0 to 60 seconds - Precision: 0.1% relative error - Memory: ~18 KB per histogram
TSC Calibration¶
The TSC runs at a fixed frequency on modern CPUs. For accurate nanosecond conversion:
// Approximate: assumes ~3GHz TSC
let nanos = cycles / 3;
// For production accuracy, calibrate at startup:
// 1. Read TSC
// 2. Sleep for known duration
// 3. Read TSC again
// 4. Compute cycles_per_nanosecond
ARM Counter Frequency¶
On AArch64, read the counter frequency from hardware:
let freq: u64;
unsafe { std::arch::asm!("mrs {}, cntfrq_el0", out(reg) freq) };
let ns_per_tick = 1_000_000_000.0 / freq as f64;
Best Practices¶
Do¶
- Use
read_start()before andread_end()after the measured operation - Export histograms at natural batch boundaries (e.g., every 1000 samples)
- Keep the background aggregator on a separate thread
- Pre-allocate histograms at startup
Don't¶
- Compute quantiles on the hot path (O(N) operation)
- Share histograms between threads (use per-thread + aggregation)
- Call system time functions (
Instant::now()) on the hot path if possible - Record every sample to a channel (use batched histogram export)
Related Documentation¶
- ADR-012: Performance Monitoring - Architecture decision
- Low-Latency Tuning - Thread affinity, busy-polling, memory pools
- CLI Reference - Performance-related flags