ADR 017: Automated Market Discovery and Matching¶
Status¶
| Component | Status |
|---|---|
| Phase 1: Text Similarity | ✅ Accepted & Implemented |
| Phase 2: Fingerprint Matching | ✅ Complete (24 tests) |
| Phase 3: Embedding Matching | ✅ Complete (15 tests) |
| Phase 4: LLM Verification | ✅ Complete (15 tests) |
| Phase 5: Feedback Learning | ✅ Complete (23 tests) |
| Operations & Deployment (docs) | ✅ Accepted (NFR-DISC requirements proposed) |
Note: Code samples in Phases 2-5 are illustrative designs, not approved implementations. Final implementation may differ based on validation results.
Revision Summary¶
Post-implementation testing revealed that pure text similarity (Jaccard + Levenshtein) is insufficient for cross-platform market matching. Real-world market pairs have low lexical similarity despite semantic equivalence:
| Market Pair | Jaccard Score |
|---|---|
| Kalshi: "Will Trump buy Greenland?" vs Polymarket: "Will the US acquire part of Greenland in 2026?" | 8.3% |
| Kalshi: "Will Washington win the 2026 Pro Football Championship?" vs Polymarket: "Super Bowl Champion 2026" | 9.1% |
These scores fall far below the 60% threshold, missing obvious matches. This ADR revision proposes a fingerprint-based matching approach aligned with industry best practices.
Implementation Notes (Phase 1 - Text Similarity - Completed)¶
Completed: 2026-01-22
Phase 1 (Text Similarity) implemented in 5 sub-phases with 48 tests (377 total tests passing):
| Sub-Phase | Focus | Tests | Status |
|---|---|---|---|
| 1a | Data Types & Storage | 12 | Complete |
| 1b | Text Matching Engine | 10 | Complete |
| 1c | Discovery API Clients | 8 | Complete |
| 1d | Scanner & Approval Workflow | 10 | Complete |
| 1e | CLI Integration | 8 | Complete |
Note: These are sub-phases of Phase 1 only. Phases 2-5 (Fingerprint, Embedding, LLM, Feedback) are Proposed - see roadmap below.
Council Verified: Each sub-phase passed LLM Council review with confidence >= 0.87
GitHub Issues: #41-#48
Implementation Notes (Phase 2a - Fingerprint & Entity Extraction)¶
Completed: 2026-01-23
Phase 2a (Fingerprint Foundation) implemented with 10 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/entity_extractor.rs |
5 | Rule-based NER for persons, crypto, prices, dates, events |
src/discovery/fingerprint.rs |
5 | MarketFingerprint extraction with event type detection |
Key Components:
- EntityExtractor: Pattern-based extraction using regex for persons (Trump, Biden), crypto (BTC→Bitcoin), price targets ($100k), dates (Q2 2026), events (Super Bowl)
- MarketFingerprint: Structured representation with entity, event_type, metric spec, resolution window, outcome type
- EventType detection: PriceTarget, Election, Acquisition, Announcement, SportingEvent, EconomicIndicator
Dependencies Added:
- regex = "1.10" for pattern matching
- lazy_static = "1.4" for compiled regex caching
GitHub Issue: #49
Implementation Notes (Phase 2b - FingerprintMatcher & EntityIndex)¶
Completed: 2026-01-23
Phase 2b (Fingerprint Matching) implemented with 8 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/entity_index.rs |
3 | Inverted index for O(1) entity lookup, alias database |
src/discovery/fingerprint_matcher.rs |
5 | Field-weighted scoring (entity 30%, date 25%, threshold 20%, outcome 15%, source 10%) |
Key Components:
- EntityIndex: Inverted index mapping entity names to market IDs
- AliasDatabase: Canonical name resolution (BTC→Bitcoin, Pro Football Championship→Super Bowl)
- FingerprintMatcher: Weighted scoring with configurable weights and thresholds
- ScoreBreakdown: Detailed per-field score transparency
Field Weights: | Field | Weight | Comparison Method | |-------|--------|-------------------| | Entity | 0.30 | Exact or alias match | | Date | 0.25 | Year/quarter/month overlap | | Threshold | 0.20 | Numeric comparison with 5% tolerance | | Outcome | 0.15 | Binary vs multi-choice | | Source | 0.10 | Event type match (placeholder) |
GitHub Issue: #50
Implementation Notes (Phase 2c - Integration & Validation)¶
Completed: 2026-01-23
Phase 2c (Integration & Validation) implemented with 6 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/integration_tests.rs |
6 | Golden pair validation, evaluation metrics |
tests/golden_pairs.json |
- | Test data for known market pairs |
Golden Test Pairs: | ID | Expected Match | Min Score | |----|----------------|-----------| | greenland-acquisition | true | 0.60 | | super-bowl-2026 | true | 0.55 | | btc-100k | true | 0.75 | | fed-rate-cut | true | 0.65 | | different-threshold-negative | false | - | | different-year-negative | false | - |
Evaluation Framework:
- GoldenTestData: Load and validate against golden pairs
- EvaluationResult: Precision, recall, F1 metrics
- Automated accuracy tracking for CI/CD
GitHub Issue: #51
Phase 2 Complete: - FR-MD-011: Fingerprint extraction ✅ - FR-MD-012: Entity-based candidates ✅ - FR-MD-013: Weighted scoring ✅ - FR-MD-014: Rule-based NER ✅
Implementation Notes (Phase 3a - Embedding Infrastructure)¶
Completed: 2026-01-23
Phase 3a (Embedding Infrastructure) implemented with 9 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/embedding.rs |
6 | Embedding generation, cosine similarity, serialization |
src/discovery/vector_store.rs |
3 | SQLite storage, nearest neighbor search |
Key Components:
- Embedding: Vector representation with cosine similarity
- HashEmbedder: Development/testing embedder using SHA-256
- VectorStore: SQLite-based embedding storage with nearest neighbor search
- Trait-based design for future ONNX/API backends
Embedding Interface:
pub trait Embedder: Send + Sync {
fn embed(&self, text: &str) -> Embedding;
fn embed_batch(&self, texts: &[&str]) -> Vec<Embedding>;
fn dimension(&self) -> usize;
}
GitHub Issue: #52
Implementation Notes (Phase 3b - Hybrid Scoring)¶
Completed: 2026-01-23
Phase 3b (Hybrid Scoring) implemented with 6 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/hybrid_scorer.rs |
6 | Combined fingerprint + embedding + text scoring |
Key Components:
- HybridScorer: Combines all matching signals with configurable weights
- HybridWeights: α=0.50 fingerprint, β=0.40 embedding, γ=0.10 text
- HybridScoreBreakdown: Full transparency of each component
- calibrate_score(): Linear calibration (placeholder for isotonic regression)
Hybrid Formula:
GitHub Issue: #53
Phase 3 Complete: - FR-MD-018: Embedding infrastructure ✅ - FR-MD-019: Vector storage ✅ - FR-MD-020: Batch embeddings ✅ - FR-MD-021: Hybrid scoring ✅ - FR-MD-022: Confidence calibration ✅
Implementation Notes (Phase 4 - LLM Verification)¶
Completed: 2026-01-23
Phase 4 (LLM Verification) implemented with 15 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/prompts.rs |
2 | Structured verification prompts, escape_for_prompt() |
src/discovery/llm_verifier.rs |
6 | LLM response parsing, cost tracking with budget enforcement |
src/discovery/escalation.rs |
7 | Tiered escalation (None → Haiku → Sonnet → Human), configurable thresholds |
Key Components:
- FilledPrompt: Structured prompt with metadata (estimated tokens)
- LlmVerifier: Builds prompts, parses JSON responses, tracks costs
- LlmCostTracker: Per-request cost tracking with $50/day default budget
- EscalationEngine: Tiered escalation based on score uncertainty, warnings, volume
Escalation Levels: | Level | Trigger | Cost | |-------|---------|------| | None | Score ≥ 0.85, no warnings | $0 | | Haiku | Score 0.60-0.85, minor warnings | ~$0.001/verification | | Sonnet | Conflicting signals, major warnings | ~$0.01/verification | | Human | LLM uncertain, resolution differences | Manual review |
GitHub Issues: #54, #55
Phase 4 Complete: - FR-MD-024: LLM verification prompt engineering ✅ - FR-MD-025: Cost-optimized LLM invocation ✅ - FR-MD-026: Automated escalation rules ✅
Implementation Notes (Phase 5 - Feedback Learning)¶
Completed: 2026-01-23
Phase 5 (Feedback Learning) implemented with 23 tests:
| File | Tests | Purpose |
|---|---|---|
src/discovery/decision_log.rs |
7 | Decision logging, JSONL export, stratified sampling |
src/discovery/alias_learner.rs |
5 | Alias learning with confidence, in-memory cache |
src/discovery/weight_optimizer.rs |
6 | Gradient-free F1 optimization with bounded weights |
src/discovery/evaluation_pipeline.rs |
5 | Orchestrates all Phase 4-5 components |
Key Components:
- DecisionLogger: SQLite-backed decision logging with full context preservation
- MatchDecision: Captures scores, escalation level, corrections, category
- TrainingExample: Exportable format with label (0/1) for model training
- AliasLearner: Learns aliases from corrections, confidence grows with confirmations
- WeightOptimizer: Gradient-free search for optimal fingerprint/embedding/text weights
- EvaluationPipeline: Single entry point that orchestrates scoring, escalation, logging, alias learning
Training Data Export:
// Export approved/rejected pairs for model training
let logger = DecisionLogger::new_in_memory()?;
let training_data = logger.export_to_jsonl()?;
// Returns JSONL with label, scores, category
Alias Learning Flow:
// Learn alias from human correction
learner.learn_from_correction("BTC", "Bitcoin", "human_approval")?;
// Confidence increases with each confirmation: 1.0 - (1.0 / (count + 1.0))
Weight Optimization:
// Optimize weights from historical decisions
let optimizer = WeightOptimizer::new();
let result = optimizer.optimize(&examples);
// Returns OptimizationResult with improved F1 score
GitHub Issues: #56, #57
Phase 5 Complete: - FR-MD-028: Decision logging with context ✅ - FR-MD-029: Training data export pipeline ✅ - FR-MD-030: Automatic entity alias learning ✅ - FR-MD-031: Fingerprint weight optimization ✅
Post-Implementation Learnings¶
Problem 1: Text Similarity Misses Semantic Equivalence¶
The current algorithm uses score = 0.6 × Jaccard(tokens) + 0.4 × Levenshtein_normalized. This approach fails because:
- Synonym blindness: "Super Bowl" ≠ "Pro Football Championship" lexically
- Paraphrase blindness: "Trump buy" ≠ "US acquire" semantically equivalent but zero overlap
- Dilution by stop words: "Greenland" signal diluted by "Will", "the", "in", etc.
Problem 2: API Sort Order Returns Different Market Types¶
- Kalshi default sort: Returns high-volume sports/weather markets first
- Polymarket default sort: Returns high-volume politics/crypto markets first
- Even scanning 2000 markets per platform yields minimal overlap in market categories
Problem 3: Multivariate Event Filtering Required¶
Kalshi's API returns sports parlays by default. Required adding mve_filter=exclude to get prediction markets.
Industry Validation¶
Research of existing cross-platform arbitrage solutions confirms these findings:
| Tool | Matching Approach |
|---|---|
| pmxt | Manual slug-based configuration |
| Dome API | Unified API with manual market mapping |
| Matchr | Curated match database (1,500+ markets) |
| EventArb | Manual market selection with arb calculator |
| Polymarket-Kalshi-Arbitrage-Bot | "Intelligent matching" via entity extraction + text similarity |
Key insight: No tool relies solely on text similarity. All use either: - Manual curation/configuration - Entity extraction + structured field matching ("fingerprinting") - Curated databases of known matches
Prior Council Feedback¶
This ADR directly addresses concerns raised in the LLM Council Design Reviews:
Design Review 1 (FR-MD-002 Fuzzy Matching): "DANGEROUS. Downgrade to 'Candidate Proposal' only. Require human sign-off."
Design Review 1 (FR-MD-003): "No arb should execute on a mapped pair without a signed human verification bit."
Design Review 2 (Approved): "Safety Gates (FR-MD-003): Requiring human confirmation for market mapping prevents catastrophic 'bad data' trades (e.g., mapping 'Trump' to 'Trump Jr')."
Context¶
The arbiter-bot currently requires manual identification and configuration of market pairs between Polymarket and Kalshi. Market tickers are hardcoded (e.g., KXBTC-25JAN31-B95000), creating operational overhead:
- Manual discovery burden: Operators must research markets on both platforms independently
- Missed opportunities: New markets may go undetected
- No persistent mapping store: Mappings exist only in memory
- Scaling limitation: Cannot efficiently monitor thousands of markets
Industry Context¶
Academic research from IMDEA Networks Institute documented over $40 million in arbitrage profits from Polymarket alone (April 2024 - April 2025). Existing arbitrage bots (e.g., polymarket-arbitrage) watch 10,000+ markets using automated matching. Cross-platform studies show ~6% of 102,275 events have semantic relations across venues.
Critical Constraint¶
Settlement semantics differ across platforms. The 2024 government shutdown case illustrates: - Polymarket: "OPM issues shutdown announcement" - Kalshi: "Actual shutdown exceeding 24 hours"
Same event, different resolution criteria, potentially different outcomes. Human verification remains mandatory per existing requirement FR-MD-003.
Decision¶
Implement an automated market discovery and matching system using a three-stage fingerprint-based pipeline:
- Stage 1: Candidate Generation - Fast narrowing by keywords, dates, categories
- Stage 2: Fingerprint Matching - Structured field comparison with weighted scoring
- Stage 3: Human Verification - Resolution criteria review with semantic warnings
Revised Options Analysis¶
Option A: Pure Text Similarity (Current - Insufficient)¶
| Criterion | Assessment |
|---|---|
| Accuracy | Low - misses semantic matches |
| Cost | No per-match API costs (uses cached market data) |
| Latency | Sub-millisecond per comparison |
| Explainability | High (score breakdown visible) |
| Verdict | Insufficient for production use |
Evidence: Real market pairs score 8-9% similarity, far below any reasonable threshold.
Option B: Fingerprint-Based Matching (Proposed)¶
| Criterion | Assessment |
|---|---|
| Accuracy | High - matches on structured fields |
| Cost | No per-match API costs (local processing on cached data) |
| Latency | ~10ms per comparison (entity extraction) |
| Explainability | High (field-by-field comparison) |
| Complexity | Medium (requires entity extraction) |
Algorithm: 1. Extract "market fingerprint" with structured fields 2. Match on canonical fields (entity, date, threshold, resolution source) 3. Score similarity across fields with appropriate weights 4. Generate candidates for human review
Option C: Embedding-Based Semantic Matching (Future Enhancement)¶
| Criterion | Assessment |
|---|---|
| Accuracy | Highest - captures semantic meaning |
| Cost | ~$0.0001 per embedding (or local model) |
| Latency | +50-200ms per embedding |
| Complexity | High (embedding service, vector DB) |
| Verdict | Consider for Phase 3 enhancement (after fingerprint foundation) |
Option D: Hybrid Fingerprint + LLM Verification (Future Enhancement)¶
| Criterion | Assessment |
|---|---|
| Accuracy | Highest - LLM catches edge cases |
| Cost | ~$0.01-0.05 per verification |
| Latency | +200-500ms per LLM call |
| Verdict | Consider for high-value market verification |
Option E: External Service Integration (Alternative)¶
| Criterion | Assessment |
|---|---|
| Accuracy | High (curated by service provider) |
| Cost | API subscription fees |
| Dependency | External service availability |
| Candidates | Matchr (curated DB), Dome (unified API) |
| Verdict | Consider as fallback or validation source |
Rationale for Option B (Fingerprint-Based)¶
- Proven approach: Industry tools (Polymarket-Kalshi-Arbitrage-Bot) use entity extraction
- Addresses root cause: Matches on semantic fields, not surface text
- No external dependencies: Local processing, no API costs
- Extensible: Can add embeddings or LLM verification later
- Explainable: Field-by-field comparison is auditable
- Council Compliant: Still generates candidates for human review (FR-MD-003)
Revised Architecture¶
┌─────────────────┐ ┌─────────────────┐
│ Polymarket API │ │ Kalshi API │
│ (Gamma endpoint)│ │ (/v2/markets) │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────────────────────────────┐
│ DiscoveryScannerActor │
│ - Market enumeration with pagination │
│ - mve_filter=exclude for Kalshi │
│ - Category/date pre-filtering │
└────────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ MarketFingerprintExtractor │
│ - Entity extraction (NER) │
│ - Date/threshold parsing │
│ - Resolution source identification │
│ - Outcome structure normalization │
└────────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ FingerprintMatcher │
│ Stage 1: Candidate generation (fast) │
│ Stage 2: Field-by-field scoring │
│ Stage 3: Semantic warning detection │
└────────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ CandidateMatch (SQLite) │
│ - Pending / Approved / Rejected │
│ - Fingerprint diff for review │
│ - Semantic warnings │
└────────────────────┬────────────────────┘
│
▼ (Human approval via CLI)
┌─────────────────────────────────────────┐
│ MappingManager (existing) │
│ - propose_mapping() → verify_mapping() │
│ - FR-MD-003 safety gate preserved │
└─────────────────────────────────────────┘
Market Fingerprint Schema¶
/// Canonical market fingerprint for cross-platform matching
pub struct MarketFingerprint {
/// Primary entity (e.g., "Trump", "Bitcoin", "Fed")
pub entity: String,
/// Secondary entities (e.g., "Greenland", "Denmark")
pub secondary_entities: Vec<String>,
/// Event type (e.g., "acquisition", "election", "price_target")
pub event_type: EventType,
/// Metric and direction (e.g., "price >= $100,000")
pub metric: Option<MetricSpec>,
/// Geographic scope (e.g., "US", "global")
pub scope: Option<String>,
/// Resolution date/time window
pub resolution_window: ResolutionWindow,
/// Outcome structure
pub outcome_type: OutcomeType, // Binary | MultiOutcome | Range
/// Resolution source (e.g., "BLS", "AP", "FOMC")
pub resolution_source: Option<String>,
/// Original title (for reference)
pub original_title: String,
}
pub struct MetricSpec {
pub name: String, // "price", "rate", "count"
pub direction: Direction, // Above | Below | Between | Exactly
pub threshold: Decimal, // Use rust_decimal for financial precision
pub unit: Option<String>, // "$", "%", "basis points"
}
// Note: Use rust_decimal::Decimal for threshold to avoid floating-point
// precision issues in financial comparisons (e.g., 0.1 + 0.2 != 0.3 in f64)
pub struct ResolutionWindow {
pub date: Option<NaiveDate>,
pub time: Option<NaiveTime>,
pub timezone: Option<Tz>, // Use chrono_tz::Tz for type-safe timezones
pub tolerance_days: i32, // For fuzzy date matching
}
// Note: All times should be normalized to UTC for comparison.
// Local timezone is preserved for display purposes only.
Revised Matching Algorithm¶
Stage 1: Candidate Generation (Fast Narrowing)¶
FOR each Kalshi market K:
1. Extract keywords from K.title + K.rules
2. Query Polymarket index by:
- Keyword overlap (BM25 or inverted index)
- Resolution date proximity (±14 days)
- Category match (if available)
3. Return top N candidates (N=50)
Complexity: O(n log n) with inverted index
Stage 2: Fingerprint Matching (Weighted Scoring)¶
FOR each candidate pair (K, P):
fingerprint_K = extract_fingerprint(K)
fingerprint_P = extract_fingerprint(P)
score = weighted_sum([
(entity_match(K, P), weight=0.30), # Primary entity
(date_match(K, P), weight=0.25), # Resolution date
(threshold_match(K, P), weight=0.20), # Numeric thresholds
(outcome_match(K, P), weight=0.15), # Binary vs multi
(resolution_source_match(K, P), weight=0.10), # Data source
])
IF score >= 0.70:
create_candidate(K, P, score)
Weight Rationale & Validation Plan:
| Field | Weight | Rationale | Validation Method |
|---|---|---|---|
| Entity | 0.30 | Primary disambiguator (Trump vs Biden) | A/B test on historical pairs |
| Date | 0.25 | Critical for time-bound events | Precision/recall on date-similar pairs |
| Threshold | 0.20 | Important for numeric markets (price targets) | Manual review of 50 threshold markets |
| Outcome | 0.15 | Binary vs multi-outcome affects pairing | Confusion matrix analysis |
| Source | 0.10 | Settlement source differences cause disputes | Historical dispute rate analysis |
Initial Values: These weights are starting estimates based on domain analysis. They will be validated against a golden set of 100+ manually-verified market pairs before production deployment. The Phase 5 feedback loop will continuously optimize weights based on human approval decisions.
Threshold Validation:
| Threshold | Value | Purpose | Validation Criteria |
|---|---|---|---|
| Candidate creation | ≥ 0.70 | Balance precision/recall | Target: Precision ≥ 0.80, Recall ≥ 0.70 |
| Auto-approve (future) | ≥ 0.95 | High-confidence automation | Zero false positives in test set |
| Uncertain zone | 0.70-0.85 | Trigger LLM verification (Phase 4) | Review rate < 20% of candidates |
Empirical Validation Required: Before enabling auto-approval or LLM escalation, a minimum of 100 human decisions must be collected to calibrate thresholds and measure actual precision/recall.
Stage 3: Semantic Warning Detection¶
warnings = []
IF K.resolution_source != P.resolution_source:
warnings.append("Different resolution sources")
IF abs(K.resolution_date - P.resolution_date) > 1 day:
warnings.append("Resolution dates differ by {days}")
IF K.rules contains "announcement" AND P.rules contains "actual":
warnings.append("Announcement vs actual event timing")
IF K.outcome_count != P.outcome_count:
warnings.append("Different outcome structures")
# Require explicit acknowledgment before approval
Entity Extraction Approaches¶
Option 1: Rule-Based NER (Recommended for MVP)¶
/// Extract entities using pattern matching
fn extract_entities(title: &str, rules: &str) -> Vec<Entity> {
let mut entities = Vec::new();
// Known entity patterns
let patterns = [
(r"(?i)\b(Trump|Biden|Harris|Obama)\b", EntityType::Person),
(r"(?i)\b(Bitcoin|BTC|Ethereum|ETH)\b", EntityType::Crypto),
(r"(?i)\b(Fed|FOMC|CPI|GDP|NFP)\b", EntityType::Economic),
(r"(?i)\b(Super Bowl|World Series|NBA Finals)\b", EntityType::Sports),
(r"(?i)\b(Greenland|Ukraine|Taiwan)\b", EntityType::Location),
(r"\$[\d,]+(?:k|K|M|B)?", EntityType::PriceTarget),
(r"(?i)\b(20\d{2})\b", EntityType::Year),
];
for (pattern, entity_type) in patterns {
// Extract and deduplicate matches
}
entities
}
Option 2: ML-Based NER (Future Enhancement)¶
Use a lightweight NER model (e.g., spaCy, Hugging Face transformers) for more robust entity extraction:
# Example with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_entities_ml(text):
doc = nlp(text)
return [(ent.text, ent.label_) for ent in doc.ents]
Option 3: LLM-Based Extraction (Future Enhancement)¶
Use an LLM to extract structured fingerprints:
Prompt: Extract a structured fingerprint from this market:
Title: "Will Trump buy Greenland?"
Rules: "Resolves Yes if US purchases any part of Greenland from Denmark before Jan 20, 2029"
Expected output:
{
"entity": "Trump",
"secondary_entities": ["Greenland", "Denmark", "US"],
"event_type": "acquisition",
"resolution_date": "2029-01-20",
"resolution_source": null
}
Module Structure (Revised)¶
arbiter-engine/src/
├── discovery/
│ ├── mod.rs
│ ├── scanner.rs # DiscoveryScannerActor
│ ├── normalizer.rs # Text normalization
│ ├── matcher.rs # Text similarity (existing)
│ ├── fingerprint.rs # NEW: Fingerprint extraction
│ ├── fingerprint_matcher.rs # NEW: Field-based matching
│ ├── entity_extractor.rs # NEW: Entity extraction (NER)
│ ├── candidate.rs # CandidateMatch types
│ ├── storage.rs # SQLite persistence
│ └── approval.rs # Human approval workflow
└── market/
└── discovery_client/
├── mod.rs
├── polymarket_gamma.rs
└── kalshi_markets.rs # With mve_filter=exclude
CLI Interface (Enhanced)¶
# Discover markets with fingerprint matching
cargo run --features discovery -- --discover-markets --verbose
# Show fingerprint for a specific market (debugging)
cargo run --features discovery -- --show-fingerprint --ticker "KXGREENLAND-29"
# Review candidates with fingerprint diff
cargo run --features discovery -- --review-candidates
# Import from external matching service (future)
cargo run --features discovery -- --import-matches --source matchr
External Service Integration (Future)¶
For validation or as a fallback, integrate with existing matching services:
/// External matching service trait
#[async_trait]
pub trait ExternalMatchingService {
/// Query for known matches of a market
async fn find_matches(&self, market_id: &str) -> Result<Vec<ExternalMatch>, Error>;
/// Validate a proposed match
async fn validate_match(&self, pair: &MarketPair) -> Result<ValidationResult, Error>;
}
/// Implementations
pub struct MatchrClient { /* curated database */ }
pub struct DomeClient { /* unified API */ }
pub struct PmxtClient { /* open-source library */ }
Requirements Traceability¶
Existing Requirements Implemented:
| ID | Requirement | Status | Implementation |
|---|---|---|---|
| FR-MD-001 | Persistent cache of mappings | Complete | SQLite storage |
| FR-MD-002 | Fuzzy matching as suggestion engine only | Revision needed | Fingerprint matching |
| FR-MD-003 | Human Confirmation required | Complete | CLI approval workflow |
| FR-MD-004 | Auto-discover markets by expiration | Complete | Scanner with date filter |
| FR-MD-005 | Track resolution status and dates | Complete | DiscoveredMarket fields |
| FR-MD-006 | Enumerate Polymarket Gamma API | Complete | polymarket_gamma.rs |
| FR-MD-007 | Enumerate Kalshi /v2/markets | Complete | kalshi_markets.rs |
| FR-MD-008 | Semantic warning detection | Complete | Warning flags |
| FR-MD-009 | Audit logging | Complete | JSONL audit log |
New/Revised Requirements:
| ID | Requirement | Phase | Priority |
|---|---|---|---|
| FR-MD-011 | Fingerprint extraction from market titles/rules | 2a | Must |
| FR-MD-012 | Entity-based candidate generation | 2a | Must |
| FR-MD-013 | Field-weighted similarity scoring | 2b | Must |
| FR-MD-014 | Rule-based named entity recognition | 2a | Must |
| FR-MD-015 | ML-based NER integration | 2c | Could |
| FR-MD-016 | Embedding-based semantic matching | 3b | Should |
| FR-MD-017 | External service integration (Matchr/Dome) | 3c | Could |
| FR-MD-018 | Embedding model evaluation and selection | 3a | Should |
| FR-MD-019 | Vector storage integration | 3a | Should |
| FR-MD-020 | Batch embedding generation pipeline | 3a | Should |
| FR-MD-021 | Hybrid scoring algorithm | 3b | Should |
| FR-MD-022 | Confidence calibration | 3b | Should |
| FR-MD-023 | Contrastive fine-tuning on approved pairs | 3c | Could |
| FR-MD-024 | LLM verification prompt engineering | 4a | Should |
| FR-MD-025 | Cost-optimized LLM invocation | 4a | Should |
| FR-MD-026 | Automated escalation to LLM | 4b | Should |
| FR-MD-027 | Resolution criteria deep analysis | 4c | Could |
| FR-MD-028 | Decision logging with feedback | 5a | Should |
| FR-MD-029 | Training data export pipeline | 5a | Should |
| FR-MD-030 | Automatic entity alias learning | 5b | Should |
| FR-MD-031 | Fingerprint weight optimization | 5b | Could |
| FR-MD-032 | Continuous evaluation and retraining | 5c | Could |
Migration Path¶
Phase 2a: Fingerprint Foundation¶
- Implement
MarketFingerprintstruct - Implement rule-based entity extractor
- Add fingerprint storage to SQLite schema
- Unit tests for extraction accuracy
Phase 2b: Fingerprint Matcher¶
- Implement
FingerprintMatcherwith weighted scoring - Replace text similarity as primary matching method
- Keep text similarity as fallback/tiebreaker
- Integration tests with real market data
Phase 2c: Validation & Tuning¶
- Test against known market pairs (Greenland, Super Bowl, etc.)
- Tune field weights based on precision/recall
- Add ML-based NER if rule-based insufficient
- Council review of revised implementation
Phase 3: Embedding-Based Semantic Matching (Option C)¶
Goal: Add vector embedding similarity as a complementary matching signal that captures semantic meaning beyond structured fields.
Phase 3a: Embedding Infrastructure¶
Requirements: FR-MD-018, FR-MD-019, FR-MD-020
- Model Selection
- Evaluate embedding models for prediction market domain:
all-MiniLM-L6-v2(384 dims, fast, local)text-embedding-3-small(1536 dims, OpenAI API)voyage-finance-2(1024 dims, finance-tuned)
- Benchmark on golden set of known market pairs
-
Selection criteria: F1 score ≥ 0.85 on domain, latency ≤ 100ms
-
Vector Storage
- Option A: SQLite with
sqlite-vecextension (simple, local) - Option B: PostgreSQL with
pgvector(scalable, production) - Option C: FAISS index with SQLite metadata (fast ANN search)
-
Schema addition:
-
Embedding Pipeline
- Batch embedding generation during market discovery scan
- Incremental updates for new markets
- Cache embeddings to avoid recomputation
Phase 3b: Hybrid Matching Integration¶
Requirements: FR-MD-021, FR-MD-022
-
Semantic Candidate Generation
impl SemanticMatcher { /// Find semantically similar markets using embedding search pub async fn find_similar(&self, market: &DiscoveredMarket, k: usize) -> Vec<SimilarMarket> { let embedding = self.embed(&market.title, &market.description).await?; self.vector_store.nearest_neighbors(&embedding, k).await } } -
Hybrid Scoring Algorithm
Validation Plan: Hybrid weights will be determined empirically via grid search over the golden set. Initial values are estimates based on: (1) fingerprint provides structured matching, (2) embeddings capture semantics, (3) text similarity as fallback for simple cases. Target: F1 ≥ 0.90 on held-out test set.
- Confidence Calibration
- Track score distributions for true/false matches
- Calibrate thresholds to achieve target precision/recall
- Separate thresholds for different market categories
Phase 3c: Domain Adaptation & Fine-Tuning¶
Requirements: FR-MD-023
- Training Data Collection
- Export approved match pairs as positive examples
- Export rejected pairs as negative examples
- Export human-modified entity mappings
-
Target: 500+ labeled pairs before fine-tuning
-
Contrastive Fine-Tuning
Note: This Python code is for the ML training pipeline only (offline batch process). The trained model is exported to ONNX format for use in the Rust runtime via
ort(ONNX Runtime) crate.
# Fine-tune embedding model on prediction market pairs
from sentence_transformers import SentenceTransformer, losses
# Load pre-trained model as starting point
model = SentenceTransformer('all-MiniLM-L6-v2')
# ContrastiveLoss pulls matching pairs together in embedding space
# while pushing non-matching pairs apart. This teaches the model
# that "Super Bowl" and "Pro Football Championship" should be close.
train_loss = losses.ContrastiveLoss(model)
# Train on (anchor, positive, negative) triplets from human decisions
# - Anchor: Kalshi market title
# - Positive: Approved Polymarket match
# - Negative: Rejected Polymarket candidate (hard negative)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3, # Few epochs to avoid overfitting on small dataset
warmup_steps=100 # Gradual learning rate increase
)
- A/B Testing
- Deploy fine-tuned model alongside base model
- Compare precision/recall on new market pairs
- Gradual rollout based on performance
Phase 4: LLM-Based Verification (Option D)¶
Goal: Use LLM reasoning for high-confidence verification of uncertain matches and deep resolution criteria analysis.
Phase 4a: LLM Verification Pipeline¶
Requirements: FR-MD-024, FR-MD-025
-
Structured Verification Prompt
You are a prediction market analyst. Determine if these two markets are equivalent. MARKET A (Kalshi): - Title: "{kalshi_title}" - Resolution: "{kalshi_rules}" - Expiration: {kalshi_date} MARKET B (Polymarket): - Title: "{poly_title}" - Resolution: "{poly_rules}" - Expiration: {poly_date} Analyze: 1. Are they about the SAME underlying event? (not just similar topics) 2. Would "Yes" on Market A correspond to "Yes" on Market B? 3. Are the resolution criteria compatible? List any differences. 4. Could different resolution timing cause different outcomes? Output JSON: { "equivalent": true|false, "confidence": 0.0-1.0, "reasoning": "...", "warnings": ["..."], "resolution_differences": ["..."] } -
Cost-Optimized Invocation
- Only invoke LLM for:
- Fingerprint score between 0.70-0.85 (uncertain zone)
- High-value markets (volume > $10k)
- Markets with semantic warnings
- Use Claude Haiku for initial screening ($0.001/verification)
- Escalate to Claude Sonnet for complex cases ($0.01/verification)
-
Budget cap: $50/day default, configurable
-
Response Parsing & Validation
#[derive(Deserialize)] pub struct LlmVerificationResult { pub equivalent: bool, pub confidence: f64, pub reasoning: String, pub warnings: Vec<String>, pub resolution_differences: Vec<String>, } impl LlmVerifier { pub async fn verify(&self, pair: &CandidateMatch) -> Result<LlmVerificationResult, Error> { let prompt = self.build_prompt(pair); let response = self.llm_client.complete(&prompt).await?; self.parse_response(&response) } }
Phase 4b: Automated Escalation¶
Requirements: FR-MD-026
- Uncertainty Detection
- Fingerprint score in "uncertain zone" (0.70-0.85)
- Conflicting signals (high entity match, low date match)
- Semantic warnings present
-
Resolution criteria contain complex conditions
-
Escalation Rules
impl EscalationPolicy { pub fn should_escalate_to_llm(&self, result: &MatchResult) -> bool { // Uncertain fingerprint score if result.score >= 0.70 && result.score < 0.85 { return true; } // High variance in field scores let variance = result.field_scores.values().variance(); if variance > 0.15 { return true; } // Semantic warnings present if !result.warnings.is_empty() { return true; } false } } -
Human Review of LLM Decisions
- Initially: All LLM-verified matches require human confirmation
- After calibration (100+ decisions): Auto-approve if LLM confidence ≥ 0.95
- Always require human review if LLM identifies resolution differences
Phase 4c: Resolution Criteria Deep Analysis¶
Requirements: FR-MD-027
-
Structured Resolution Comparison
Analyze the resolution criteria for these markets: Market A Resolution: "{criteria_a}" Market B Resolution: "{criteria_b}" Extract and compare: 1. Resolution SOURCE (who determines outcome) 2. Resolution TIMING (when is outcome determined) 3. Resolution THRESHOLD (what conditions trigger Yes/No) 4. Edge cases (what happens if ambiguous) Output structured comparison with compatibility assessment. -
Semantic Difference Detection
- "Announcement" vs "actual event" timing
- Different authoritative sources (AP vs Reuters)
- Different thresholds or measurement periods
-
Geographic scope differences
-
Human-Readable Reports
- Generate side-by-side comparison for human reviewers
- Highlight specific text differences
- Provide recommendation with confidence level
Phase 5: Reinforcement Learning from Human Feedback¶
Goal: Create a continuous improvement loop where human approval decisions improve all matching components over time.
Phase 5a: Feedback Data Collection¶
Requirements: FR-MD-028, FR-MD-029
-
Decision Logging Schema
CREATE TABLE match_decisions ( id UUID PRIMARY KEY, candidate_id UUID REFERENCES candidates(id), decision ENUM('approved', 'rejected', 'modified'), reviewer_id TEXT, decision_timestamp TIMESTAMP, -- Context at decision time fingerprint_score REAL, embedding_similarity REAL, llm_confidence REAL, -- Feedback data rejection_reason TEXT, entity_corrections JSONB, -- {"old": "BTC", "new": "Bitcoin"} resolution_notes TEXT, -- For training is_training_example BOOLEAN DEFAULT true ); -
Feedback Categories
- Entity Corrections: Human corrects entity extraction errors
- Alias Additions: Human identifies new synonyms/aliases
- False Positive Patterns: Common rejection reasons
-
Edge Case Documentation: Complex matches with notes
-
Export for Training
impl FeedbackExporter { /// Export approved pairs as positive training examples pub fn export_positive_pairs(&self) -> Vec<TrainingPair> { self.storage.query_decisions("approved") .map(|d| TrainingPair { anchor: d.kalshi_title, positive: d.poly_title, metadata: d.entity_corrections, }) .collect() } /// Export rejected pairs as hard negatives pub fn export_negative_pairs(&self) -> Vec<TrainingPair> { self.storage.query_decisions("rejected") .filter(|d| d.fingerprint_score > 0.5) // Hard negatives only .map(|d| TrainingPair { anchor: d.kalshi_title, negative: d.poly_title, rejection_reason: d.rejection_reason, }) .collect() } }
Phase 5b: Automatic Improvements¶
Requirements: FR-MD-030, FR-MD-031
-
Entity Alias Database Updates
impl AliasLearner { /// Learn new aliases from approved matches pub fn learn_from_approval(&mut self, decision: &MatchDecision) { if let Some(corrections) = &decision.entity_corrections { for (old, new) in corrections { self.alias_db.add_alias(new, old); log::info!("Learned alias: {} -> {}", old, new); } } // Also learn implicit aliases from matched pairs let kalshi_entities = extract_entities(&decision.kalshi_title); let poly_entities = extract_entities(&decision.poly_title); for (k, p) in self.align_entities(&kalshi_entities, &poly_entities) { if k.name != p.name && k.entity_type == p.entity_type { self.alias_db.add_alias(&k.name, &p.name); } } } } -
Fingerprint Weight Optimization
impl WeightOptimizer { /// Optimize field weights based on historical decisions /// Note: f64 is appropriate here for ML model features/labels /// (not financial values - those use rust_decimal::Decimal) pub fn optimize(&self, decisions: &[MatchDecision]) -> FieldWeights { // Use logistic regression to find optimal weights let features: Vec<Vec<f64>> = decisions.iter() .map(|d| vec![ d.entity_score, d.date_score, d.threshold_score, d.outcome_score, d.source_score, ]) .collect(); let labels: Vec<f64> = decisions.iter() .map(|d| if d.decision == "approved" { 1.0 } else { 0.0 }) .collect(); let model = LogisticRegression::fit(&features, &labels); FieldWeights { entity: model.coefficients[0].abs(), date: model.coefficients[1].abs(), threshold: model.coefficients[2].abs(), outcome: model.coefficients[3].abs(), source: model.coefficients[4].abs(), }.normalized() } } -
Semantic Warning Pattern Learning
- Analyze rejection reasons to identify new warning patterns
- Add learned patterns to semantic warning detection
- Reduce false negatives from undetected issues
Phase 5c: Continuous Evaluation & Retraining¶
Requirements: FR-MD-032
- Golden Set Maintenance
- Add all approved/rejected pairs to golden test set
- Stratify by market category (politics, crypto, sports, etc.)
-
Target: 100+ pairs per category
-
Automated Regression Testing
# Run weekly evaluation against golden set cargo run --features discovery -- --evaluate-matching \ --golden-set data/golden_pairs.json \ --output reports/weekly_eval.json # Alert if metrics degrade if [ $(jq '.f1_score' reports/weekly_eval.json) < 0.85 ]; then notify "Matching quality degraded: F1 < 0.85" fi -
Retraining Pipeline
-
Model Versioning
- Track model versions with performance metrics
- Enable rollback if new model underperforms
-
Maintain audit trail of model changes
-
Weekly Improvement Cycle
┌─────────────────────────────────────────────────────────────┐
│ Weekly Improvement Cycle │
├─────────────────────────────────────────────────────────────┤
│ Monday: Export new decisions, update golden set │
│ Tuesday: Retrain embedding model, optimize weights │
│ Wednesday: Validate new models on golden set │
│ Thursday-Saturday: A/B test (10% traffic) │
│ Sunday: Promote if improved, rollback if degraded │
└─────────────────────────────────────────────────────────────┘
Extended Requirements (Phase 3-5)¶
| ID | Requirement | Phase | Priority |
|---|---|---|---|
| FR-MD-018 | Embedding model evaluation and selection | 3a | Should |
| FR-MD-019 | Vector storage integration (sqlite-vec or pgvector) | 3a | Should |
| FR-MD-020 | Batch embedding generation pipeline | 3a | Should |
| FR-MD-021 | Hybrid scoring (fingerprint + embedding + text) | 3b | Should |
| FR-MD-022 | Confidence calibration for hybrid scores | 3b | Should |
| FR-MD-023 | Contrastive fine-tuning on approved pairs | 3c | Could |
| FR-MD-024 | LLM verification prompt engineering | 4a | Should |
| FR-MD-025 | Cost-optimized LLM invocation strategy | 4a | Should |
| FR-MD-026 | Automated escalation to LLM for uncertain matches | 4b | Should |
| FR-MD-027 | Resolution criteria deep analysis via LLM | 4c | Could |
| FR-MD-028 | Decision logging with feedback data | 5a | Should |
| FR-MD-029 | Training data export pipeline | 5a | Should |
| FR-MD-030 | Automatic entity alias learning | 5b | Should |
| FR-MD-031 | Fingerprint weight optimization from feedback | 5b | Could |
| FR-MD-032 | Continuous evaluation and retraining pipeline | 5c | Could |
Operations & Deployment¶
CI/CD Pipeline¶
The discovery feature uses Cargo feature flags for conditional compilation. CI/CD pipelines must explicitly enable the feature for testing.
GitHub Actions Configuration (.github/workflows/ci.yml):
discovery-tests:
name: Discovery Feature Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Rust
uses: dtolnay/rust-action@stable
- name: Run discovery unit tests
run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery
- name: Run discovery integration tests
run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
env:
KALSHI_DEMO_KEY_ID: ${{ secrets.KALSHI_DEMO_KEY_ID }}
KALSHI_DEMO_PRIVATE_KEY: ${{ secrets.KALSHI_DEMO_PRIVATE_KEY }}
Feature Flag Management:
- Development: cargo run --features discovery -- --discover-markets
- Production: Enable via ECS task definition environment variable CARGO_FEATURES=discovery
- Gradual rollout: Percentage-based feature flag (future via LaunchDarkly or similar)
Deployment Pipeline: 1. PR → CI tests (unit + integration with mocked APIs) 2. Merge to main → Build Docker image with discovery feature 3. Deploy to staging → E2E validation against Kalshi demo API 4. Deploy to production → Gradual rollout with monitoring
Container Image Versioning:
- Tag convention: v{version}-{git-sha} (e.g., v1.2.0-abc1234)
- Latest stable: latest tag points to production-ready image
- Image promotion: dev → staging → production via re-tagging
- Rollback: Deploy previous image by digest (immutable reference)
Health Check Endpoint:
GET /health/discovery
Response: 200 OK
{
"status": "healthy|degraded|unhealthy",
"last_scan_at": "2026-01-23T10:30:00Z",
"last_scan_duration_ms": 45000,
"pending_candidates": 12,
"api_status": {
"polymarket": "healthy",
"kalshi": "healthy"
}
}
- healthy: Scanner running, APIs accessible, no recent errors
- degraded: Scanner running but one API unavailable or rate-limited
- unhealthy: Scanner not running or database unavailable
Environment Configuration¶
| Environment | API Endpoints | Credentials | Purpose |
|---|---|---|---|
| Local | Production APIs | .env file |
Development |
| Demo | Kalshi Demo API (--kalshi-demo) |
Demo credentials | Integration testing |
| Staging | Production APIs | AWS Secrets Manager | Pre-production validation |
| Production | Production APIs | AWS Secrets Manager | Live operation |
Environment-Specific Settings:
# Local development
DISCOVERY_SCAN_INTERVAL_SECS=3600
DISCOVERY_BATCH_SIZE=100
DISCOVERY_DB_PATH=./discovery.db
# Staging
DISCOVERY_SCAN_INTERVAL_SECS=1800
DISCOVERY_BATCH_SIZE=500
DISCOVERY_DB_PATH=/data/discovery.db
# Production
DISCOVERY_SCAN_INTERVAL_SECS=900
DISCOVERY_BATCH_SIZE=1000
DISCOVERY_DB_PATH=/data/discovery.db
Security Considerations¶
API Credential Management: - Polymarket Gamma API: Public API, no authentication required - Kalshi /v2/markets: Uses existing RSA-PSS authentication (ADR-009) - Key Rotation: Leverage existing key rotation infrastructure via AWS Secrets Manager - Secret Storage: All credentials stored in AWS Secrets Manager, never in code or config files
Rate Limiting Compliance:
| Platform | Rate Limit | Implementation | Monitoring |
|---|---|---|---|
| Polymarket Gamma | 60 req/min | Token bucket limiter | CloudWatch discovery/api/rate_limit_errors |
| Kalshi | 100 req/min | Existing client limiter | CloudWatch discovery/api/rate_limit_errors |
Audit Trail Requirements (FR-MD-009):
- All candidate approvals/rejections logged with timestamp, reviewer ID, reason
- Decision context preserved (scores, features, warnings acknowledged)
- JSONL format for compliance export: discovery_audit.jsonl
- Retention: 7 years per financial services compliance
Data Privacy: - Market data cached locally in SQLite (no PII) - Embeddings computed locally (Phase 3a uses local models) - No market data exported to external services without explicit configuration
Testing Strategy¶
Test Pyramid:
/\
/ \ E2E (Demo environment, Kalshi demo API)
/----\
/ \ Integration (wiremock for HTTP mocking)
/--------\
/ \ Unit tests (48 existing, inline in modules)
/--------------\
Unit Tests (48 existing):
- candidate.rs: 5 tests (CandidateMatch types, status transitions)
- storage.rs: 7 tests (SQLite CRUD, audit logging)
- normalizer.rs: 3 tests (text normalization)
- matcher.rs: 7 tests (similarity scoring, pre-filtering)
- polymarket_gamma.rs: 4 tests (API client, pagination)
- kalshi_markets.rs: 4 tests (API client, mve_filter)
- scanner.rs: 5 tests (deduplication, batch processing)
- approval.rs: 5 tests (warning acknowledgment, safety gates)
- cli.rs: 8 tests (CLI argument parsing, integration)
Run with: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery
Integration Tests (wiremock): - Mock Gamma API responses with realistic market data - Mock Kalshi API responses including pagination cursors - Test rate limiting behavior under load - Test error handling and retry logic
Run with: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
E2E Tests (Demo environment):
- Full scan against Kalshi demo API (--kalshi-demo flag)
- Validate candidate generation produces realistic results
- Manual approval workflow verification
- Audit log generation verification
Run with: cargo run --manifest-path arbiter-engine/Cargo.toml --features discovery -- --kalshi-demo --discover-markets
Golden Set Validation:
- Known market pairs maintained in tests/golden_pairs.json
- Automated weekly evaluation via CI job
- Alert on F1 score degradation below 0.85 threshold
Production Readiness¶
Monitoring Metrics (CloudWatch):
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
discovery/scan/duration_ms |
Timer | Scan cycle duration | > 5 min (P2) |
discovery/scan/errors |
Counter | Scan failures | 3 consecutive (P1) |
discovery/candidates/generated |
Gauge | Candidates per scan | - |
discovery/candidates/approved |
Counter | Approval count | - |
discovery/candidates/approval_rate |
Gauge | Approval rate | < 10% (P2) |
discovery/api/rate_limit_errors |
Counter | Rate limit violations | > 10/hour (P2) |
discovery/api/latency_ms |
Timer | API response times | p99 > 1s (P3) |
Alerting Rules:
| Priority | Condition | Action |
|---|---|---|
| P1 | Scan failure (3 consecutive) | Page on-call, investigate immediately |
| P2 | Approval rate < 10% | Slack alert, review matching thresholds |
| P2 | Rate limit errors > 10/hour | Slack alert, increase scan interval |
| P3 | Scan duration > 5 minutes | Slack alert, review batch size |
Logging (tracing):
- Structured JSON output via tracing-subscriber
- Log levels: ERROR (failures), WARN (rate limits), INFO (scan results), DEBUG (matching details)
- Correlation IDs for request tracing
Scaling Considerations:
| Phase | Configuration | Capacity |
|---|---|---|
| MVP | Single scanner, SQLite, hourly scan | ~2,000 markets/platform |
| Scale | Leader election, PostgreSQL, 15-min scan | ~10,000 markets/platform |
| Embedding | Dedicated embedding service, pgvector | ~50,000 markets |
Disaster Recovery:
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Scanner crash | 5 min | 0 | ECS auto-restart |
| Database corruption | 30 min | 24 hr | Restore from S3 backup |
| API outage | N/A | N/A | Graceful degradation, retry with backoff |
| Region failure | 4 hr | 1 hr | Cross-region restore from S3 |
Backup Strategy: - SQLite: Daily S3 backup via ECS scheduled task - PostgreSQL (future): Aurora automated backups (7-day retention) - Audit logs: S3 archival (90-day hot, 7-year cold storage)
Graceful Degradation: - API failure: Retry with exponential backoff, continue with available platform - Database failure: Read-only mode, serve cached candidates - Embedding service down (Phase 3): Fallback to fingerprint-only matching
Runbook¶
1. Discovery scan not running:
# Check feature flag is enabled
kubectl logs -l app=trading-core | grep "discovery"
# Verify environment variable
kubectl exec -it trading-core-xxx -- env | grep DISCOVERY
# Check scanner actor health
kubectl exec -it trading-core-xxx -- curl localhost:8080/health/discovery
2. High rate limit errors:
# Check current rate limit metrics
aws cloudwatch get-metric-statistics --namespace arbiter --metric-name "discovery/api/rate_limit_errors"
# Increase scan interval temporarily
kubectl set env deployment/trading-core DISCOVERY_SCAN_INTERVAL_SECS=7200
# Review API quotas with platform
3. Low approval rate (<10%):
# Review recent candidates
cargo run --features discovery -- --list-candidates --status pending --limit 20
# Check matching threshold configuration
grep "threshold" config/discovery.toml
# Review fingerprint weight configuration
grep "weight" config/discovery.toml
4. Database corruption:
# Stop scanner
kubectl scale deployment/trading-core --replicas=0
# Restore from S3 backup
aws s3 cp s3://arbiter-backups/discovery/latest.db /data/discovery.db
# Verify integrity
sqlite3 /data/discovery.db "PRAGMA integrity_check"
# Restart scanner
kubectl scale deployment/trading-core --replicas=1
Incident Response Summary: - P1 (Scanner failure): PagerDuty alert → On-call acknowledges → Investigate logs → Escalate if not resolved in 30 min - P2 (Quality degradation): Slack alert → Next business day review → Adjust thresholds - P3 (Performance warning): Ticket created → Sprint backlog
Consequences¶
Positive¶
- Higher accuracy: Fingerprint matching catches semantic equivalence
- Explainable: Field-by-field comparison is auditable
- Extensible: Easy to add new entity types, fields, or ML models
- Industry-aligned: Matches approach used by successful tools
Negative¶
- Increased complexity: More code to maintain
- Entity extraction errors: NER may miss or misclassify entities
- Requires tuning: Field weights need empirical optimization
Neutral¶
- Existing code preserved: Text similarity remains as fallback
- Human review unchanged: FR-MD-003 safety gate preserved
Safety Guarantees (Unchanged)¶
- Human-in-the-loop preserved: Candidates require explicit approval
- FR-MD-003 maintained: Uses existing
MappingManager.verify_mapping() - Semantic warnings block quick approval: Must acknowledge settlement differences
- Audit trail: All approvals/rejections logged with timestamp and reviewer
- No automated trading on unverified pairs: Matches existing safety architecture
References¶
Industry Tools¶
- pmxt - Unified API for prediction markets
- Dome API - Developer infrastructure for prediction markets
- Matchr - Cross-platform market aggregator (1,500+ curated matches)
- EventArb - Cross-platform arbitrage calculator
- Polymarket-Kalshi-Arbitrage-Bot - Open-source bot with entity extraction
Research¶
- Awesome Prediction Market Tools
- Semantic Non-Fungibility research
- Semantic Trading research
- Prediction Market Arbitrage Guide
API Documentation¶
- Kalshi API - mve_filter parameter
- Polymarket Gamma API