ADR 012: Performance Monitoring Architecture¶
Status¶
Accepted
Context¶
Arbiter-Bot is a low-latency statistical arbitrage engine where microsecond-level performance is critical. Arbitrage opportunities exist for milliseconds; slow execution means missed profits or adverse fills. We need comprehensive performance observability that:
- Captures microsecond-level latency distributions, not just averages
- Minimizes measurement overhead on the hot path
- Identifies bottlenecks via profiling and metrics
- Operates in production without degrading trading performance
This ADR focuses on how to measure and observe performance. Architectural optimizations (thread affinity, memory pools, busy-polling) are covered in ADR-013.
Decision¶
Implement a multi-layer performance monitoring architecture with distinct strategies based on latency tolerance.
Layer 1: Hot Path Instrumentation (Microsecond-Sensitive)¶
The hot path (tick-to-trade) uses zero-allocation, non-blocking instrumentation:
| Component | Tool | Rationale |
|---|---|---|
| Timing | RDTSCP + LFENCE | Serializing reads prevent CPU reordering; sub-nanosecond precision |
| Latency Recording | hdrhistogram (per-thread) |
Captures full distribution including p99.99; merged periodically |
| Histogram Export | SPSC ring buffer (snapshots only) | Background thread consumes histogram snapshots, not individual samples |
use std::arch::x86_64::{_rdtsc, _rdtscp, _mm_lfence};
use std::sync::atomic::{AtomicU64, Ordering};
/// Cache-line aligned to prevent false sharing
#[repr(C, align(64))]
pub struct HotPathMetrics {
/// Pre-computed nanoseconds per TSC tick (avoids hot-path division)
ns_per_tick: f64,
_pad: [u8; 56], // Pad to cache line
}
/// Per-thread histogram with double-buffering (no allocation on hot path)
pub struct ThreadLocalHistogram {
/// Active histogram for recording (hot path writes here)
active: hdrhistogram::Histogram<u64>,
/// Spare histogram for swap (pre-allocated, avoids hot-path allocation)
spare: hdrhistogram::Histogram<u64>,
sample_count: u64,
}
impl ThreadLocalHistogram {
pub fn new() -> Self {
Self {
active: hdrhistogram::Histogram::new(3).unwrap(),
spare: hdrhistogram::Histogram::new(3).unwrap(),
sample_count: 0,
}
}
}
impl HotPathMetrics {
/// Serializing timestamp read at start of measurement
#[inline(always)]
pub fn record_start() -> u64 {
unsafe {
_mm_lfence(); // Serialize: ensure prior instructions complete
_rdtsc()
}
}
/// Serializing timestamp read at end of measurement
#[inline(always)]
pub fn record_end_tsc() -> u64 {
let mut aux: u32 = 0;
unsafe {
let tsc = _rdtscp(&mut aux); // RDTSCP serializes prior instructions
_mm_lfence(); // Prevent subsequent instructions from moving up
tsc
}
}
}
impl ThreadLocalHistogram {
/// Hot path: record to active histogram (no cross-thread ops)
#[inline(always)]
pub fn record(&mut self, start_tsc: u64, metrics: &HotPathMetrics) {
let end_tsc = HotPathMetrics::record_end_tsc();
let elapsed_tsc = end_tsc.wrapping_sub(start_tsc);
// Multiplication (not division) for conversion
let nanos = (elapsed_tsc as f64 * metrics.ns_per_tick) as u64;
// Record to thread-local active histogram (no lock, no contention)
let _ = self.active.record(nanos);
self.sample_count += 1;
}
/// Called by the OWNING THREAD at natural batch boundaries.
/// Swaps active/spare histograms (O(1), NO ALLOCATION).
/// IMPORTANT: Does NOT compute quantiles on hot path (O(N) operation).
#[inline]
pub fn maybe_export_histogram(
&mut self,
histogram_tx: &mut rtrb::Producer<ExportedHistogram>,
export_interval_samples: u64,
) {
if self.sample_count >= export_interval_samples {
// Reset spare histogram (fast, just clears counts)
self.spare.reset();
// Swap active <-> spare (O(1) pointer swap, NO ALLOCATION)
std::mem::swap(&mut self.active, &mut self.spare);
let count = self.sample_count;
self.sample_count = 0;
// Clone histogram to send to background thread.
// Trade-off: clone() is O(bucket_count) but happens infrequently (every N samples).
// Hot path record() remains O(1). For N=1000, export overhead is amortized.
// Alternative: triple-buffering avoids clone but adds complexity.
//
// NOTE: This clone is a known deviation from "zero allocation on every sample".
// The allocation happens once per export_interval_samples (e.g., every 1000 samples).
// For stricter requirements, use triple-buffering or raw sample ring buffers.
if histogram_tx.push(ExportedHistogram {
histogram: self.spare.clone(),
count
}).is_err() {
// Increment dropped export counter for observability
// (In production, this would be an atomic counter exposed to metrics)
// DROPPED_EXPORTS.fetch_add(1, Ordering::Relaxed);
}
}
}
}
/// Raw histogram sent to background aggregator (quantiles computed there, not on hot path)
pub struct ExportedHistogram {
pub histogram: hdrhistogram::Histogram<u64>,
pub count: u64,
}
/// Background aggregator computes quantiles (O(N) is acceptable here)
pub fn aggregate_histograms(rx: &mut rtrb::Consumer<ExportedHistogram>) -> Option<HistogramSnapshot> {
let mut merged = hdrhistogram::Histogram::<u64>::new(3).unwrap();
let mut total_count = 0u64;
while let Ok(exported) = rx.pop() {
merged.add(&exported.histogram).ok();
total_count += exported.count;
}
if total_count > 0 {
Some(HistogramSnapshot {
p50: merged.value_at_quantile(0.50),
p99: merged.value_at_quantile(0.99),
p999: merged.value_at_quantile(0.999),
p9999: merged.value_at_quantile(0.9999),
count: total_count,
})
} else {
None
}
}
/// Final snapshot with computed quantiles (computed by background thread)
pub struct HistogramSnapshot {
pub p50: u64,
pub p99: u64,
pub p999: u64,
pub p9999: u64,
pub count: u64,
}
Key Design Decisions:
- Hot path records to histogram only - no ring buffer push per sample (avoids cache contention)
- Owning thread exports raw histograms - swaps histogram with fresh one (O(1)) and sends to background; no quantile computation on hot path (value_at_quantile is O(N))
- Background thread computes quantiles - expensive O(N) iteration happens off the critical path
- Use LFENCE before RDTSC or RDTSCP to prevent instruction reordering
- Pre-compute ns_per_tick at startup (multiply instead of divide on hot path)
- SPSC ring buffer (rtrb) for histogram export to background aggregator
Layer 2: Warm Path Instrumentation (Millisecond-Tolerant)¶
For operations that can tolerate slight overhead (connection management, position tracking):
| Component | Tool | Rationale |
|---|---|---|
| Structured Logging | tracing crate |
Async-safe, span-based context |
| Subscriber | Non-blocking writer | Prevents I/O blocking |
| Metrics Export | Prometheus exposition format | Industry standard, scrape-based |
use tracing::{instrument, info_span, Instrument};
impl ArbiterActor {
#[instrument(skip(self), fields(market_id = %update.market_id))]
pub async fn process_market_update(&mut self, update: MarketUpdate) {
let span = info_span!("arbitrage_detection");
async {
// Detection logic
}
.instrument(span)
.await
}
}
Layer 3: System-Level Profiling (Development/Diagnostics)¶
For deep performance analysis during development and incident investigation:
| Tool | Purpose | When Used |
|---|---|---|
perf |
CPU profiling, cache miss analysis | Development, post-incident |
coz |
Causal profiling for optimization prioritization | Performance tuning sprints |
lttng |
Kernel/user space transition analysis | Investigating syscall overhead |
flamegraph |
Visual call stack analysis | Identifying hot functions |
Platform Portability¶
TSC is x86_64-specific. For ARM and development environments:
#[cfg(target_arch = "x86_64")]
mod timing {
use std::arch::x86_64::{_rdtsc, _rdtscp, _mm_lfence};
#[inline(always)]
pub fn read_start() -> u64 {
unsafe { _mm_lfence(); _rdtsc() }
}
#[inline(always)]
pub fn read_end() -> u64 {
let mut aux: u32 = 0;
unsafe { _rdtscp(&mut aux) }
}
}
#[cfg(target_arch = "aarch64")]
mod timing {
use std::sync::OnceLock;
/// Timer frequency in Hz (read from CNTFRQ_EL0 at startup)
static FREQ_HZ: OnceLock<u64> = OnceLock::new();
/// Read timer frequency from hardware register
fn get_freq_hz() -> u64 {
*FREQ_HZ.get_or_init(|| {
let freq: u64;
unsafe { std::arch::asm!("mrs {}, cntfrq_el0", out(reg) freq) };
freq
})
}
/// Nanoseconds per tick (pre-computed for fast conversion)
pub fn ns_per_tick() -> f64 {
1_000_000_000.0 / get_freq_hz() as f64
}
#[inline(always)]
pub fn read_start() -> u64 {
// Single asm block prevents compiler from reordering ISB and MRS
let cnt: u64;
unsafe {
std::arch::asm!(
"isb", // Ensure prior instructions complete
"mrs {}, cntvct_el0", // Read timer
out(reg) cnt,
options(nostack, nomem)
);
}
cnt
}
#[inline(always)]
pub fn read_end() -> u64 {
// Single asm block prevents compiler from reordering
let cnt: u64;
unsafe {
std::arch::asm!(
"isb", // Ensure measured code has completed
"mrs {}, cntvct_el0", // Read timer
out(reg) cnt,
options(nostack, nomem)
);
}
cnt
}
}
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
mod timing {
use std::sync::OnceLock;
use std::time::Instant;
/// Static epoch for monotonic timestamps (initialized on first use)
static EPOCH: OnceLock<Instant> = OnceLock::new();
fn epoch() -> &'static Instant {
EPOCH.get_or_init(Instant::now)
}
// Fallback: ~20-30ns overhead via vDSO, but portable
pub fn read_start() -> u64 {
epoch().elapsed().as_nanos() as u64
}
pub fn read_end() -> u64 {
epoch().elapsed().as_nanos() as u64
}
}
Key Performance Indicators (KPIs)¶
| KPI | Target | Alert Threshold |
|---|---|---|
| Tick-to-Trade p50 | < 100 μs | > 500 μs |
| Tick-to-Trade p99 | < 500 μs | > 2 ms |
| Tick-to-Trade p99.99 | < 2 ms | > 10 ms |
| Event Loop Lag | < 1 ms | > 5 ms |
| Lock Contention Rate | < 0.1% | > 1% |
| CPU Cache Miss Rate | < 5% | > 15% |
| Kernel Context Switches/sec | < 1000 | > 5000 |
| Missed Snapshot Intervals | 0 | > 10/min |
Metrics Export Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Hot Path Thread │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ RDTSCP │──>│Per-Thread│──>│ Export │ │
│ │ + LFENCE │ │Histogram │ │ Snapshot │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ (Owning thread calls maybe_export_snapshot() │
│ at batch boundaries - no cross-thread histogram access) │
│ │ │
└──────────────────────────────────────┼───────────────────────┘
│ (SPSC ring buffer)
▼
┌─────────────────────────────────────────────────────────────┐
│ Metrics Aggregator │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Merged │ │ Dropped │ │ Gauges │ │
│ │Histogram │ │ Counter │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
└───────────────────────┼──────────────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Prometheus│ │Grafana │ │AlertMgr │
│ Scrape │ │Dashboard│ │ │
└─────────┘ └─────────┘ └─────────┘
Implementation Phases¶
| Phase | Scope | Deliverables |
|---|---|---|
| 1 | Hot path metrics | TSC timing with serialization, per-thread HDR histograms, SPSC ring buffers |
| 2 | Warm path tracing | tracing integration, non-blocking subscriber |
| 3 | Metrics export | Prometheus endpoint, Grafana dashboards |
| 4 | Alerting | Alert rules for KPI thresholds |
Consequences¶
Positive¶
- Sub-microsecond measurement precision via serialized TSC timestamps
- Full latency distribution visibility with HDR histograms (p99.99)
- Production-safe profiling with non-blocking instrumentation
- Actionable optimization data via causal profiling
- No hot-path contention - histogram snapshots exported asynchronously
Negative¶
- Increased complexity in metrics infrastructure
- Platform-specific code (TSC/RDTSCP is x86-specific; ARM needs CNTVCT_EL0)
- Memory overhead from histograms (one per thread)
- Operational burden of maintaining dashboards and alerts
Risks¶
| Risk | Mitigation |
|---|---|
| TSC frequency varies across cores | Calibrate at startup; pin threads to prevent migration (see ADR-013) |
| Missed snapshot intervals | Size snapshot interval appropriately (1s default); alert on missed intervals |
| Histogram memory per thread | Bound histogram range (e.g., 1μs - 10s); limit precision bits |
Alternatives Considered¶
Alternative 1: Standard std::time::Instant¶
- Pro: Cross-platform, simple API
- Con: On Linux, uses vDSO (not syscall) with ~20-30ns overhead; still slower than TSC (~5ns)
- Decision: Use for warm path and fallback; TSC for hot path on x86_64
Alternative 2: Full tracing Instrumentation Everywhere¶
- Pro: Consistent API, excellent tooling
- Con: Even with non-blocking subscriber, has measurable overhead (~100ns per span)
- Decision: Use for warm path only; TSC + ring buffer for hot path
Alternative 3: Sampling-Based Profiling Only¶
- Pro: Very low overhead
- Con: Misses tail latency events, insufficient for p99.99 analysis
- Decision: Use continuous recording with HDR histograms instead