ADR 017: Automated Market Discovery and Matching¶

Status¶

Component	Status
Phase 1: Text Similarity	✅ Accepted & Implemented
Phase 2: Fingerprint Matching	✅ Complete (24 tests)
Phase 3: Embedding Matching	✅ Complete (15 tests)
Phase 4: LLM Verification	✅ Complete (15 tests)
Phase 5: Feedback Learning	✅ Complete (23 tests)
Operations & Deployment (docs)	✅ Accepted (NFR-DISC requirements proposed)

Note: Code samples in Phases 2-5 are illustrative designs, not approved implementations. Final implementation may differ based on validation results.

Revision Summary¶

Post-implementation testing revealed that pure text similarity (Jaccard + Levenshtein) is insufficient for cross-platform market matching. Real-world market pairs have low lexical similarity despite semantic equivalence:

Market Pair	Jaccard Score
Kalshi: "Will Trump buy Greenland?" vs Polymarket: "Will the US acquire part of Greenland in 2026?"	8.3%
Kalshi: "Will Washington win the 2026 Pro Football Championship?" vs Polymarket: "Super Bowl Champion 2026"	9.1%

These scores fall far below the 60% threshold, missing obvious matches. This ADR revision proposes a fingerprint-based matching approach aligned with industry best practices.

Implementation Notes (Phase 1 - Text Similarity - Completed)¶

Completed: 2026-01-22

Phase 1 (Text Similarity) implemented in 5 sub-phases with 48 tests (377 total tests passing):

Sub-Phase	Focus	Tests	Status
1a	Data Types & Storage	12	Complete
1b	Text Matching Engine	10	Complete
1c	Discovery API Clients	8	Complete
1d	Scanner & Approval Workflow	10	Complete
1e	CLI Integration	8	Complete

Note: These are sub-phases of Phase 1 only. Phases 2-5 (Fingerprint, Embedding, LLM, Feedback) are Proposed - see roadmap below.

Council Verified: Each sub-phase passed LLM Council review with confidence >= 0.87

GitHub Issues: #41-#48

Implementation Notes (Phase 2a - Fingerprint & Entity Extraction)¶

Completed: 2026-01-23

Phase 2a (Fingerprint Foundation) implemented with 10 tests:

File	Tests	Purpose
`src/discovery/entity_extractor.rs`	5	Rule-based NER for persons, crypto, prices, dates, events
`src/discovery/fingerprint.rs`	5	MarketFingerprint extraction with event type detection

Key Components: - EntityExtractor: Pattern-based extraction using regex for persons (Trump, Biden), crypto (BTC→Bitcoin), price targets ($100k), dates (Q2 2026), events (Super Bowl) - MarketFingerprint: Structured representation with entity, event_type, metric spec, resolution window, outcome type - EventType detection: PriceTarget, Election, Acquisition, Announcement, SportingEvent, EconomicIndicator

Dependencies Added: - regex = "1.10" for pattern matching - lazy_static = "1.4" for compiled regex caching

GitHub Issue: #49

Implementation Notes (Phase 2b - FingerprintMatcher & EntityIndex)¶

Completed: 2026-01-23

Phase 2b (Fingerprint Matching) implemented with 8 tests:

File	Tests	Purpose
`src/discovery/entity_index.rs`	3	Inverted index for O(1) entity lookup, alias database
`src/discovery/fingerprint_matcher.rs`	5	Field-weighted scoring (entity 30%, date 25%, threshold 20%, outcome 15%, source 10%)

Key Components: - EntityIndex: Inverted index mapping entity names to market IDs - AliasDatabase: Canonical name resolution (BTC→Bitcoin, Pro Football Championship→Super Bowl) - FingerprintMatcher: Weighted scoring with configurable weights and thresholds - ScoreBreakdown: Detailed per-field score transparency

Field Weights: | Field | Weight | Comparison Method | |-------|--------|-------------------| | Entity | 0.30 | Exact or alias match | | Date | 0.25 | Year/quarter/month overlap | | Threshold | 0.20 | Numeric comparison with 5% tolerance | | Outcome | 0.15 | Binary vs multi-choice | | Source | 0.10 | Event type match (placeholder) |

GitHub Issue: #50

Implementation Notes (Phase 2c - Integration & Validation)¶

Completed: 2026-01-23

Phase 2c (Integration & Validation) implemented with 6 tests:

File	Tests	Purpose
`src/discovery/integration_tests.rs`	6	Golden pair validation, evaluation metrics
`tests/golden_pairs.json`	-	Test data for known market pairs

Golden Test Pairs: | ID | Expected Match | Min Score | |----|----------------|-----------| | greenland-acquisition | true | 0.60 | | super-bowl-2026 | true | 0.55 | | btc-100k | true | 0.75 | | fed-rate-cut | true | 0.65 | | different-threshold-negative | false | - | | different-year-negative | false | - |

Evaluation Framework: - GoldenTestData: Load and validate against golden pairs - EvaluationResult: Precision, recall, F1 metrics - Automated accuracy tracking for CI/CD

GitHub Issue: #51

Phase 2 Complete: - FR-MD-011: Fingerprint extraction ✅ - FR-MD-012: Entity-based candidates ✅ - FR-MD-013: Weighted scoring ✅ - FR-MD-014: Rule-based NER ✅

Implementation Notes (Phase 3a - Embedding Infrastructure)¶

Completed: 2026-01-23

Phase 3a (Embedding Infrastructure) implemented with 9 tests:

File	Tests	Purpose
`src/discovery/embedding.rs`	6	Embedding generation, cosine similarity, serialization
`src/discovery/vector_store.rs`	3	SQLite storage, nearest neighbor search

Key Components: - Embedding: Vector representation with cosine similarity - HashEmbedder: Development/testing embedder using SHA-256 - VectorStore: SQLite-based embedding storage with nearest neighbor search - Trait-based design for future ONNX/API backends

Embedding Interface:

pub trait Embedder: Send + Sync {
    fn embed(&self, text: &str) -> Embedding;
    fn embed_batch(&self, texts: &[&str]) -> Vec<Embedding>;
    fn dimension(&self) -> usize;
}

GitHub Issue: #52

Implementation Notes (Phase 3b - Hybrid Scoring)¶

Completed: 2026-01-23

Phase 3b (Hybrid Scoring) implemented with 6 tests:

File	Tests	Purpose
`src/discovery/hybrid_scorer.rs`	6	Combined fingerprint + embedding + text scoring

Key Components: - HybridScorer: Combines all matching signals with configurable weights - HybridWeights: α=0.50 fingerprint, β=0.40 embedding, γ=0.10 text - HybridScoreBreakdown: Full transparency of each component - calibrate_score(): Linear calibration (placeholder for isotonic regression)

Hybrid Formula:

score = 0.50 × fingerprint + 0.40 × embedding + 0.10 × text

GitHub Issue: #53

Phase 3 Complete: - FR-MD-018: Embedding infrastructure ✅ - FR-MD-019: Vector storage ✅ - FR-MD-020: Batch embeddings ✅ - FR-MD-021: Hybrid scoring ✅ - FR-MD-022: Confidence calibration ✅

Implementation Notes (Phase 4 - LLM Verification)¶

Completed: 2026-01-23

Phase 4 (LLM Verification) implemented with 15 tests:

File	Tests	Purpose
`src/discovery/prompts.rs`	2	Structured verification prompts, escape_for_prompt()
`src/discovery/llm_verifier.rs`	6	LLM response parsing, cost tracking with budget enforcement
`src/discovery/escalation.rs`	7	Tiered escalation (None → Haiku → Sonnet → Human), configurable thresholds

Key Components: - FilledPrompt: Structured prompt with metadata (estimated tokens) - LlmVerifier: Builds prompts, parses JSON responses, tracks costs - LlmCostTracker: Per-request cost tracking with $50/day default budget - EscalationEngine: Tiered escalation based on score uncertainty, warnings, volume

Escalation Levels: | Level | Trigger | Cost | |-------|---------|------| | None | Score ≥ 0.85, no warnings | $0 | | Haiku | Score 0.60-0.85, minor warnings | ~$0.001/verification | | Sonnet | Conflicting signals, major warnings | ~$0.01/verification | | Human | LLM uncertain, resolution differences | Manual review |

GitHub Issues: #54, #55

Phase 4 Complete: - FR-MD-024: LLM verification prompt engineering ✅ - FR-MD-025: Cost-optimized LLM invocation ✅ - FR-MD-026: Automated escalation rules ✅

Implementation Notes (Phase 5 - Feedback Learning)¶

Completed: 2026-01-23

Phase 5 (Feedback Learning) implemented with 23 tests:

File	Tests	Purpose
`src/discovery/decision_log.rs`	7	Decision logging, JSONL export, stratified sampling
`src/discovery/alias_learner.rs`	5	Alias learning with confidence, in-memory cache
`src/discovery/weight_optimizer.rs`	6	Gradient-free F1 optimization with bounded weights
`src/discovery/evaluation_pipeline.rs`	5	Orchestrates all Phase 4-5 components

Key Components: - DecisionLogger: SQLite-backed decision logging with full context preservation - MatchDecision: Captures scores, escalation level, corrections, category - TrainingExample: Exportable format with label (0/1) for model training - AliasLearner: Learns aliases from corrections, confidence grows with confirmations - WeightOptimizer: Gradient-free search for optimal fingerprint/embedding/text weights - EvaluationPipeline: Single entry point that orchestrates scoring, escalation, logging, alias learning

Training Data Export:

// Export approved/rejected pairs for model training
let logger = DecisionLogger::new_in_memory()?;
let training_data = logger.export_to_jsonl()?;
// Returns JSONL with label, scores, category

Alias Learning Flow:

// Learn alias from human correction
learner.learn_from_correction("BTC", "Bitcoin", "human_approval")?;
// Confidence increases with each confirmation: 1.0 - (1.0 / (count + 1.0))

Weight Optimization:

// Optimize weights from historical decisions
let optimizer = WeightOptimizer::new();
let result = optimizer.optimize(&examples);
// Returns OptimizationResult with improved F1 score

GitHub Issues: #56, #57

Phase 5 Complete: - FR-MD-028: Decision logging with context ✅ - FR-MD-029: Training data export pipeline ✅ - FR-MD-030: Automatic entity alias learning ✅ - FR-MD-031: Fingerprint weight optimization ✅

Post-Implementation Learnings¶

Problem 1: Text Similarity Misses Semantic Equivalence¶

The current algorithm uses score = 0.6 × Jaccard(tokens) + 0.4 × Levenshtein_normalized. This approach fails because:

Synonym blindness: "Super Bowl" ≠ "Pro Football Championship" lexically
Paraphrase blindness: "Trump buy" ≠ "US acquire" semantically equivalent but zero overlap
Dilution by stop words: "Greenland" signal diluted by "Will", "the", "in", etc.

Problem 2: API Sort Order Returns Different Market Types¶

Kalshi default sort: Returns high-volume sports/weather markets first
Polymarket default sort: Returns high-volume politics/crypto markets first
Even scanning 2000 markets per platform yields minimal overlap in market categories

Problem 3: Multivariate Event Filtering Required¶

Kalshi's API returns sports parlays by default. Required adding mve_filter=exclude to get prediction markets.

Industry Validation¶

Research of existing cross-platform arbitrage solutions confirms these findings:

Tool	Matching Approach
pmxt	Manual slug-based configuration
Dome API	Unified API with manual market mapping
Matchr	Curated match database (1,500+ markets)
EventArb	Manual market selection with arb calculator
Polymarket-Kalshi-Arbitrage-Bot	"Intelligent matching" via entity extraction + text similarity

Key insight: No tool relies solely on text similarity. All use either: - Manual curation/configuration - Entity extraction + structured field matching ("fingerprinting") - Curated databases of known matches

Prior Council Feedback¶

This ADR directly addresses concerns raised in the LLM Council Design Reviews:

Design Review 1 (FR-MD-002 Fuzzy Matching): "DANGEROUS. Downgrade to 'Candidate Proposal' only. Require human sign-off."

Design Review 1 (FR-MD-003): "No arb should execute on a mapped pair without a signed human verification bit."

Design Review 2 (Approved): "Safety Gates (FR-MD-003): Requiring human confirmation for market mapping prevents catastrophic 'bad data' trades (e.g., mapping 'Trump' to 'Trump Jr')."

Context¶

The arbiter-bot currently requires manual identification and configuration of market pairs between Polymarket and Kalshi. Market tickers are hardcoded (e.g., KXBTC-25JAN31-B95000), creating operational overhead:

Manual discovery burden: Operators must research markets on both platforms independently
Missed opportunities: New markets may go undetected
No persistent mapping store: Mappings exist only in memory
Scaling limitation: Cannot efficiently monitor thousands of markets

Industry Context¶

Academic research from IMDEA Networks Institute documented over $40 million in arbitrage profits from Polymarket alone (April 2024 - April 2025). Existing arbitrage bots (e.g., polymarket-arbitrage) watch 10,000+ markets using automated matching. Cross-platform studies show ~6% of 102,275 events have semantic relations across venues.

Critical Constraint¶

Settlement semantics differ across platforms. The 2024 government shutdown case illustrates: - Polymarket: "OPM issues shutdown announcement" - Kalshi: "Actual shutdown exceeding 24 hours"

Same event, different resolution criteria, potentially different outcomes. Human verification remains mandatory per existing requirement FR-MD-003.

Decision¶

Implement an automated market discovery and matching system using a three-stage fingerprint-based pipeline:

Stage 1: Candidate Generation - Fast narrowing by keywords, dates, categories
Stage 2: Fingerprint Matching - Structured field comparison with weighted scoring
Stage 3: Human Verification - Resolution criteria review with semantic warnings

Revised Options Analysis¶

Option A: Pure Text Similarity (Current - Insufficient)¶

Criterion	Assessment
Accuracy	Low - misses semantic matches
Cost	No per-match API costs (uses cached market data)
Latency	Sub-millisecond per comparison
Explainability	High (score breakdown visible)
Verdict	Insufficient for production use

Evidence: Real market pairs score 8-9% similarity, far below any reasonable threshold.

Option B: Fingerprint-Based Matching (Proposed)¶

Criterion	Assessment
Accuracy	High - matches on structured fields
Cost	No per-match API costs (local processing on cached data)
Latency	~10ms per comparison (entity extraction)
Explainability	High (field-by-field comparison)
Complexity	Medium (requires entity extraction)

Algorithm: 1. Extract "market fingerprint" with structured fields 2. Match on canonical fields (entity, date, threshold, resolution source) 3. Score similarity across fields with appropriate weights 4. Generate candidates for human review

Option C: Embedding-Based Semantic Matching (Future Enhancement)¶

Criterion	Assessment
Accuracy	Highest - captures semantic meaning
Cost	~$0.0001 per embedding (or local model)
Latency	+50-200ms per embedding
Complexity	High (embedding service, vector DB)
Verdict	Consider for Phase 3 enhancement (after fingerprint foundation)

Option D: Hybrid Fingerprint + LLM Verification (Future Enhancement)¶

Criterion	Assessment
Accuracy	Highest - LLM catches edge cases
Cost	~$0.01-0.05 per verification
Latency	+200-500ms per LLM call
Verdict	Consider for high-value market verification

Option E: External Service Integration (Alternative)¶

Criterion	Assessment
Accuracy	High (curated by service provider)
Cost	API subscription fees
Dependency	External service availability
Candidates	Matchr (curated DB), Dome (unified API)
Verdict	Consider as fallback or validation source

Rationale for Option B (Fingerprint-Based)¶

Proven approach: Industry tools (Polymarket-Kalshi-Arbitrage-Bot) use entity extraction
Addresses root cause: Matches on semantic fields, not surface text
No external dependencies: Local processing, no API costs
Extensible: Can add embeddings or LLM verification later
Explainable: Field-by-field comparison is auditable
Council Compliant: Still generates candidates for human review (FR-MD-003)

Revised Architecture¶

┌─────────────────┐     ┌─────────────────┐
│ Polymarket API  │     │   Kalshi API    │
│ (Gamma endpoint)│     │ (/v2/markets)   │
└────────┬────────┘     └────────┬────────┘
         │                       │
         ▼                       ▼
┌─────────────────────────────────────────┐
│         DiscoveryScannerActor           │
│  - Market enumeration with pagination   │
│  - mve_filter=exclude for Kalshi        │
│  - Category/date pre-filtering          │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│       MarketFingerprintExtractor        │
│  - Entity extraction (NER)              │
│  - Date/threshold parsing               │
│  - Resolution source identification     │
│  - Outcome structure normalization      │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│          FingerprintMatcher             │
│  Stage 1: Candidate generation (fast)   │
│  Stage 2: Field-by-field scoring        │
│  Stage 3: Semantic warning detection    │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│      CandidateMatch (SQLite)            │
│  - Pending / Approved / Rejected        │
│  - Fingerprint diff for review          │
│  - Semantic warnings                    │
└────────────────────┬────────────────────┘
                     │
                     ▼  (Human approval via CLI)
┌─────────────────────────────────────────┐
│      MappingManager (existing)          │
│  - propose_mapping() → verify_mapping() │
│  - FR-MD-003 safety gate preserved      │
└─────────────────────────────────────────┘

Market Fingerprint Schema¶

/// Canonical market fingerprint for cross-platform matching
pub struct MarketFingerprint {
    /// Primary entity (e.g., "Trump", "Bitcoin", "Fed")
    pub entity: String,

    /// Secondary entities (e.g., "Greenland", "Denmark")
    pub secondary_entities: Vec<String>,

    /// Event type (e.g., "acquisition", "election", "price_target")
    pub event_type: EventType,

    /// Metric and direction (e.g., "price >= $100,000")
    pub metric: Option<MetricSpec>,

    /// Geographic scope (e.g., "US", "global")
    pub scope: Option<String>,

    /// Resolution date/time window
    pub resolution_window: ResolutionWindow,

    /// Outcome structure
    pub outcome_type: OutcomeType,  // Binary | MultiOutcome | Range

    /// Resolution source (e.g., "BLS", "AP", "FOMC")
    pub resolution_source: Option<String>,

    /// Original title (for reference)
    pub original_title: String,
}

pub struct MetricSpec {
    pub name: String,           // "price", "rate", "count"
    pub direction: Direction,   // Above | Below | Between | Exactly
    pub threshold: Decimal,     // Use rust_decimal for financial precision
    pub unit: Option<String>,   // "$", "%", "basis points"
}
// Note: Use rust_decimal::Decimal for threshold to avoid floating-point
// precision issues in financial comparisons (e.g., 0.1 + 0.2 != 0.3 in f64)

pub struct ResolutionWindow {
    pub date: Option<NaiveDate>,
    pub time: Option<NaiveTime>,
    pub timezone: Option<Tz>,   // Use chrono_tz::Tz for type-safe timezones
    pub tolerance_days: i32,    // For fuzzy date matching
}
// Note: All times should be normalized to UTC for comparison.
// Local timezone is preserved for display purposes only.

Revised Matching Algorithm¶

Stage 1: Candidate Generation (Fast Narrowing)¶

FOR each Kalshi market K:
    1. Extract keywords from K.title + K.rules
    2. Query Polymarket index by:
       - Keyword overlap (BM25 or inverted index)
       - Resolution date proximity (±14 days)
       - Category match (if available)
    3. Return top N candidates (N=50)

Complexity: O(n log n) with inverted index

Stage 2: Fingerprint Matching (Weighted Scoring)¶

FOR each candidate pair (K, P):
    fingerprint_K = extract_fingerprint(K)
    fingerprint_P = extract_fingerprint(P)

    score = weighted_sum([
        (entity_match(K, P),           weight=0.30),  # Primary entity
        (date_match(K, P),             weight=0.25),  # Resolution date
        (threshold_match(K, P),        weight=0.20),  # Numeric thresholds
        (outcome_match(K, P),          weight=0.15),  # Binary vs multi
        (resolution_source_match(K, P), weight=0.10), # Data source
    ])

    IF score >= 0.70:
        create_candidate(K, P, score)

Weight Rationale & Validation Plan:

Field	Weight	Rationale	Validation Method
Entity	0.30	Primary disambiguator (Trump vs Biden)	A/B test on historical pairs
Date	0.25	Critical for time-bound events	Precision/recall on date-similar pairs
Threshold	0.20	Important for numeric markets (price targets)	Manual review of 50 threshold markets
Outcome	0.15	Binary vs multi-outcome affects pairing	Confusion matrix analysis
Source	0.10	Settlement source differences cause disputes	Historical dispute rate analysis

Initial Values: These weights are starting estimates based on domain analysis. They will be validated against a golden set of 100+ manually-verified market pairs before production deployment. The Phase 5 feedback loop will continuously optimize weights based on human approval decisions.

Threshold Validation:

Threshold	Value	Purpose	Validation Criteria
Candidate creation	≥ 0.70	Balance precision/recall	Target: Precision ≥ 0.80, Recall ≥ 0.70
Auto-approve (future)	≥ 0.95	High-confidence automation	Zero false positives in test set
Uncertain zone	0.70-0.85	Trigger LLM verification (Phase 4)	Review rate < 20% of candidates

Empirical Validation Required: Before enabling auto-approval or LLM escalation, a minimum of 100 human decisions must be collected to calibrate thresholds and measure actual precision/recall.

Stage 3: Semantic Warning Detection¶

warnings = []

IF K.resolution_source != P.resolution_source:
    warnings.append("Different resolution sources")

IF abs(K.resolution_date - P.resolution_date) > 1 day:
    warnings.append("Resolution dates differ by {days}")

IF K.rules contains "announcement" AND P.rules contains "actual":
    warnings.append("Announcement vs actual event timing")

IF K.outcome_count != P.outcome_count:
    warnings.append("Different outcome structures")

# Require explicit acknowledgment before approval

Entity Extraction Approaches¶

Option 1: Rule-Based NER (Recommended for MVP)¶

/// Extract entities using pattern matching
fn extract_entities(title: &str, rules: &str) -> Vec<Entity> {
    let mut entities = Vec::new();

    // Known entity patterns
    let patterns = [
        (r"(?i)\b(Trump|Biden|Harris|Obama)\b", EntityType::Person),
        (r"(?i)\b(Bitcoin|BTC|Ethereum|ETH)\b", EntityType::Crypto),
        (r"(?i)\b(Fed|FOMC|CPI|GDP|NFP)\b", EntityType::Economic),
        (r"(?i)\b(Super Bowl|World Series|NBA Finals)\b", EntityType::Sports),
        (r"(?i)\b(Greenland|Ukraine|Taiwan)\b", EntityType::Location),
        (r"\$[\d,]+(?:k|K|M|B)?", EntityType::PriceTarget),
        (r"(?i)\b(20\d{2})\b", EntityType::Year),
    ];

    for (pattern, entity_type) in patterns {
        // Extract and deduplicate matches
    }

    entities
}

Option 2: ML-Based NER (Future Enhancement)¶

Use a lightweight NER model (e.g., spaCy, Hugging Face transformers) for more robust entity extraction:

# Example with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_entities_ml(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

Option 3: LLM-Based Extraction (Future Enhancement)¶

Use an LLM to extract structured fingerprints:

Prompt: Extract a structured fingerprint from this market:
Title: "Will Trump buy Greenland?"
Rules: "Resolves Yes if US purchases any part of Greenland from Denmark before Jan 20, 2029"

Expected output:
{
  "entity": "Trump",
  "secondary_entities": ["Greenland", "Denmark", "US"],
  "event_type": "acquisition",
  "resolution_date": "2029-01-20",
  "resolution_source": null
}

Module Structure (Revised)¶

arbiter-engine/src/
├── discovery/
│   ├── mod.rs
│   ├── scanner.rs                  # DiscoveryScannerActor
│   ├── normalizer.rs               # Text normalization
│   ├── matcher.rs                  # Text similarity (existing)
│   ├── fingerprint.rs              # NEW: Fingerprint extraction
│   ├── fingerprint_matcher.rs      # NEW: Field-based matching
│   ├── entity_extractor.rs         # NEW: Entity extraction (NER)
│   ├── candidate.rs                # CandidateMatch types
│   ├── storage.rs                  # SQLite persistence
│   └── approval.rs                 # Human approval workflow
└── market/
    └── discovery_client/
        ├── mod.rs
        ├── polymarket_gamma.rs
        └── kalshi_markets.rs       # With mve_filter=exclude

CLI Interface (Enhanced)¶

# Discover markets with fingerprint matching
cargo run --features discovery -- --discover-markets --verbose

# Show fingerprint for a specific market (debugging)
cargo run --features discovery -- --show-fingerprint --ticker "KXGREENLAND-29"

# Review candidates with fingerprint diff
cargo run --features discovery -- --review-candidates

# Import from external matching service (future)
cargo run --features discovery -- --import-matches --source matchr

External Service Integration (Future)¶

For validation or as a fallback, integrate with existing matching services:

/// External matching service trait
#[async_trait]
pub trait ExternalMatchingService {
    /// Query for known matches of a market
    async fn find_matches(&self, market_id: &str) -> Result<Vec<ExternalMatch>, Error>;

    /// Validate a proposed match
    async fn validate_match(&self, pair: &MarketPair) -> Result<ValidationResult, Error>;
}

/// Implementations
pub struct MatchrClient { /* curated database */ }
pub struct DomeClient { /* unified API */ }
pub struct PmxtClient { /* open-source library */ }

Requirements Traceability¶

Existing Requirements Implemented:

ID	Requirement	Status	Implementation
FR-MD-001	Persistent cache of mappings	Complete	SQLite storage
FR-MD-002	Fuzzy matching as suggestion engine only	Revision needed	Fingerprint matching
FR-MD-003	Human Confirmation required	Complete	CLI approval workflow
FR-MD-004	Auto-discover markets by expiration	Complete	Scanner with date filter
FR-MD-005	Track resolution status and dates	Complete	DiscoveredMarket fields
FR-MD-006	Enumerate Polymarket Gamma API	Complete	polymarket_gamma.rs
FR-MD-007	Enumerate Kalshi /v2/markets	Complete	kalshi_markets.rs
FR-MD-008	Semantic warning detection	Complete	Warning flags
FR-MD-009	Audit logging	Complete	JSONL audit log

New/Revised Requirements:

ID	Requirement	Phase	Priority
FR-MD-011	Fingerprint extraction from market titles/rules	2a	Must
FR-MD-012	Entity-based candidate generation	2a	Must
FR-MD-013	Field-weighted similarity scoring	2b	Must
FR-MD-014	Rule-based named entity recognition	2a	Must
FR-MD-015	ML-based NER integration	2c	Could
FR-MD-016	Embedding-based semantic matching	3b	Should
FR-MD-017	External service integration (Matchr/Dome)	3c	Could
FR-MD-018	Embedding model evaluation and selection	3a	Should
FR-MD-019	Vector storage integration	3a	Should
FR-MD-020	Batch embedding generation pipeline	3a	Should
FR-MD-021	Hybrid scoring algorithm	3b	Should
FR-MD-022	Confidence calibration	3b	Should
FR-MD-023	Contrastive fine-tuning on approved pairs	3c	Could
FR-MD-024	LLM verification prompt engineering	4a	Should
FR-MD-025	Cost-optimized LLM invocation	4a	Should
FR-MD-026	Automated escalation to LLM	4b	Should
FR-MD-027	Resolution criteria deep analysis	4c	Could
FR-MD-028	Decision logging with feedback	5a	Should
FR-MD-029	Training data export pipeline	5a	Should
FR-MD-030	Automatic entity alias learning	5b	Should
FR-MD-031	Fingerprint weight optimization	5b	Could
FR-MD-032	Continuous evaluation and retraining	5c	Could

Migration Path¶

Phase 2a: Fingerprint Foundation¶

Implement MarketFingerprint struct
Implement rule-based entity extractor
Add fingerprint storage to SQLite schema
Unit tests for extraction accuracy

Phase 2b: Fingerprint Matcher¶

Implement FingerprintMatcher with weighted scoring
Replace text similarity as primary matching method
Keep text similarity as fallback/tiebreaker
Integration tests with real market data

Phase 2c: Validation & Tuning¶

Test against known market pairs (Greenland, Super Bowl, etc.)
Tune field weights based on precision/recall
Add ML-based NER if rule-based insufficient
Council review of revised implementation

Phase 3: Embedding-Based Semantic Matching (Option C)¶

Goal: Add vector embedding similarity as a complementary matching signal that captures semantic meaning beyond structured fields.

Phase 3a: Embedding Infrastructure¶

Requirements: FR-MD-018, FR-MD-019, FR-MD-020

Model Selection
Evaluate embedding models for prediction market domain:
- all-MiniLM-L6-v2 (384 dims, fast, local)
- text-embedding-3-small (1536 dims, OpenAI API)
- voyage-finance-2 (1024 dims, finance-tuned)
Benchmark on golden set of known market pairs
Selection criteria: F1 score ≥ 0.85 on domain, latency ≤ 100ms
Vector Storage
Option A: SQLite with sqlite-vec extension (simple, local)
Option B: PostgreSQL with pgvector (scalable, production)
Option C: FAISS index with SQLite metadata (fast ANN search)

Schema addition:

ALTER TABLE discovered_markets ADD COLUMN embedding BLOB;
CREATE INDEX idx_embedding ON discovered_markets USING ivfflat (embedding vector_cosine_ops);

Embedding Pipeline
Batch embedding generation during market discovery scan
Incremental updates for new markets
Cache embeddings to avoid recomputation

Phase 3b: Hybrid Matching Integration¶

Requirements: FR-MD-021, FR-MD-022

Semantic Candidate Generation

impl SemanticMatcher {
    /// Find semantically similar markets using embedding search
    pub async fn find_similar(&self, market: &DiscoveredMarket, k: usize) -> Vec<SimilarMarket> {
        let embedding = self.embed(&market.title, &market.description).await?;
        self.vector_store.nearest_neighbors(&embedding, k).await
    }
}

Hybrid Scoring Algorithm

final_score = α × fingerprint_score + β × embedding_similarity + γ × text_similarity

Where (configurable, default values):
- α = 0.50 (fingerprint weight)
- β = 0.40 (embedding weight)
- γ = 0.10 (text similarity fallback)

Validation Plan: Hybrid weights will be determined empirically via grid search over the golden set. Initial values are estimates based on: (1) fingerprint provides structured matching, (2) embeddings capture semantics, (3) text similarity as fallback for simple cases. Target: F1 ≥ 0.90 on held-out test set.

Confidence Calibration
Track score distributions for true/false matches
Calibrate thresholds to achieve target precision/recall
Separate thresholds for different market categories

Phase 3c: Domain Adaptation & Fine-Tuning¶

Requirements: FR-MD-023

Training Data Collection
Export approved match pairs as positive examples
Export rejected pairs as negative examples
Export human-modified entity mappings
Target: 500+ labeled pairs before fine-tuning
Contrastive Fine-Tuning

Note: This Python code is for the ML training pipeline only (offline batch process). The trained model is exported to ONNX format for use in the Rust runtime via ort (ONNX Runtime) crate.

# Fine-tune embedding model on prediction market pairs
from sentence_transformers import SentenceTransformer, losses

# Load pre-trained model as starting point
model = SentenceTransformer('all-MiniLM-L6-v2')

# ContrastiveLoss pulls matching pairs together in embedding space
# while pushing non-matching pairs apart. This teaches the model
# that "Super Bowl" and "Pro Football Championship" should be close.
train_loss = losses.ContrastiveLoss(model)

# Train on (anchor, positive, negative) triplets from human decisions
# - Anchor: Kalshi market title
# - Positive: Approved Polymarket match
# - Negative: Rejected Polymarket candidate (hard negative)
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,        # Few epochs to avoid overfitting on small dataset
    warmup_steps=100 # Gradual learning rate increase
)

A/B Testing
Deploy fine-tuned model alongside base model
Compare precision/recall on new market pairs
Gradual rollout based on performance

Phase 4: LLM-Based Verification (Option D)¶

Goal: Use LLM reasoning for high-confidence verification of uncertain matches and deep resolution criteria analysis.

Phase 4a: LLM Verification Pipeline¶

Requirements: FR-MD-024, FR-MD-025

Structured Verification Prompt

You are a prediction market analyst. Determine if these two markets are equivalent.

MARKET A (Kalshi):
- Title: "{kalshi_title}"
- Resolution: "{kalshi_rules}"
- Expiration: {kalshi_date}

MARKET B (Polymarket):
- Title: "{poly_title}"
- Resolution: "{poly_rules}"
- Expiration: {poly_date}

Analyze:
1. Are they about the SAME underlying event? (not just similar topics)
2. Would "Yes" on Market A correspond to "Yes" on Market B?
3. Are the resolution criteria compatible? List any differences.
4. Could different resolution timing cause different outcomes?

Output JSON:
{
  "equivalent": true|false,
  "confidence": 0.0-1.0,
  "reasoning": "...",
  "warnings": ["..."],
  "resolution_differences": ["..."]
}

Cost-Optimized Invocation
Only invoke LLM for:
- Fingerprint score between 0.70-0.85 (uncertain zone)
- High-value markets (volume > $10k)
- Markets with semantic warnings
Use Claude Haiku for initial screening ($0.001/verification)
Escalate to Claude Sonnet for complex cases ($0.01/verification)
Budget cap: $50/day default, configurable

Response Parsing & Validation

#[derive(Deserialize)]
pub struct LlmVerificationResult {
    pub equivalent: bool,
    pub confidence: f64,
    pub reasoning: String,
    pub warnings: Vec<String>,
    pub resolution_differences: Vec<String>,
}

impl LlmVerifier {
    pub async fn verify(&self, pair: &CandidateMatch) -> Result<LlmVerificationResult, Error> {
        let prompt = self.build_prompt(pair);
        let response = self.llm_client.complete(&prompt).await?;
        self.parse_response(&response)
    }
}

Phase 4b: Automated Escalation¶

Requirements: FR-MD-026

Uncertainty Detection
Fingerprint score in "uncertain zone" (0.70-0.85)
Conflicting signals (high entity match, low date match)
Semantic warnings present
Resolution criteria contain complex conditions
Escalation Rules
```
                    
```
href="#__codelineno-22-1">impl EscalationPolicy { pub fn should_escalate_to_llm(&self, result: &MatchResult) -> bool { // Uncertain fingerprint score if result.score >= 0.70 && result.score < 0.85 { return true; } // High variance in field scores let variance = result.field_scores.values().variance(); if variance > 0.15 { return true; } // Semantic warnings present if !result.warnings.is_empty() { return true; } false } }
Human Review of LLM Decisions
Initially: All LLM-verified matches require human confirmation
After calibration (100+ decisions): Auto-approve if LLM confidence ≥ 0.95
Always require human review if LLM identifies resolution differences

Phase 4c: Resolution Criteria Deep Analysis¶

Requirements: FR-MD-027

Structured Resolution Comparison

Analyze the resolution criteria for these markets:

Market A Resolution: "{criteria_a}"
Market B Resolution: "{criteria_b}"

Extract and compare:
1. Resolution SOURCE (who determines outcome)
2. Resolution TIMING (when is outcome determined)
3. Resolution THRESHOLD (what conditions trigger Yes/No)
4. Edge cases (what happens if ambiguous)

Output structured comparison with compatibility assessment.

Semantic Difference Detection
"Announcement" vs "actual event" timing
Different authoritative sources (AP vs Reuters)
Different thresholds or measurement periods
Geographic scope differences
Human-Readable Reports
Generate side-by-side comparison for human reviewers
Highlight specific text differences
Provide recommendation with confidence level

Phase 5: Reinforcement Learning from Human Feedback¶

Goal: Create a continuous improvement loop where human approval decisions improve all matching components over time.

Phase 5a: Feedback Data Collection¶

Requirements: FR-MD-028, FR-MD-029

Decision Logging Schema

CREATE TABLE match_decisions (
    id UUID PRIMARY KEY,
    candidate_id UUID REFERENCES candidates(id),
    decision ENUM('approved', 'rejected', 'modified'),
    reviewer_id TEXT,
    decision_timestamp TIMESTAMP,

    -- Context at decision time
    fingerprint_score REAL,
    embedding_similarity REAL,
    llm_confidence REAL,

    -- Feedback data
    rejection_reason TEXT,
    entity_corrections JSONB,  -- {"old": "BTC", "new": "Bitcoin"}
    resolution_notes TEXT,

    -- For training
    is_training_example BOOLEAN DEFAULT true
);

Feedback Categories
Entity Corrections: Human corrects entity extraction errors
Alias Additions: Human identifies new synonyms/aliases
False Positive Patterns: Common rejection reasons
Edge Case Documentation: Complex matches with notes

Export for Training

impl FeedbackExporter {
    /// Export approved pairs as positive training examples
    pub fn export_positive_pairs(&self) -> Vec<TrainingPair> {
        self.storage.query_decisions("approved")
            .map(|d| TrainingPair {
                anchor: d.kalshi_title,
                positive: d.poly_title,
                metadata: d.entity_corrections,
            })
            .collect()
    }

    /// Export rejected pairs as hard negatives
    pub fn export_negative_pairs(&self) -> Vec<TrainingPair> {
        self.storage.query_decisions("rejected")
            .filter(|d| d.fingerprint_score > 0.5)  // Hard negatives only
            .map(|d| TrainingPair {
                anchor: d.kalshi_title,
                negative: d.poly_title,
                rejection_reason: d.rejection_reason,
            })
            .collect()
    }
}

Phase 5b: Automatic Improvements¶

Requirements: FR-MD-030, FR-MD-031

Entity Alias Database Updates

impl AliasLearner {
    /// Learn new aliases from approved matches
    pub fn learn_from_approval(&mut self, decision: &MatchDecision) {
        if let Some(corrections) = &decision.entity_corrections {
            for (old, new) in corrections {
                self.alias_db.add_alias(new, old);
                log::info!("Learned alias: {} -> {}", old, new);
            }
        }

        // Also learn implicit aliases from matched pairs
        let kalshi_entities = extract_entities(&decision.kalshi_title);
        let poly_entities = extract_entities(&decision.poly_title);

        for (k, p) in self.align_entities(&kalshi_entities, &poly_entities) {
            if k.name != p.name && k.entity_type == p.entity_type {
                self.alias_db.add_alias(&k.name, &p.name);
            }
        }
    }
}

Fingerprint Weight Optimization

impl WeightOptimizer {
    /// Optimize field weights based on historical decisions
    /// Note: f64 is appropriate here for ML model features/labels
    /// (not financial values - those use rust_decimal::Decimal)
    pub fn optimize(&self, decisions: &[MatchDecision]) -> FieldWeights {
        // Use logistic regression to find optimal weights
        let features: Vec<Vec<f64>> = decisions.iter()
            .map(|d| vec![
                d.entity_score,
                d.date_score,
                d.threshold_score,
                d.outcome_score,
                d.source_score,
            ])
            .collect();

        let labels: Vec<f64> = decisions.iter()
            .map(|d| if d.decision == "approved" { 1.0 } else { 0.0 })
            .collect();

        let model = LogisticRegression::fit(&features, &labels);

        FieldWeights {
            entity: model.coefficients[0].abs(),
            date: model.coefficients[1].abs(),
            threshold: model.coefficients[2].abs(),
            outcome: model.coefficients[3].abs(),
            source: model.coefficients[4].abs(),
        }.normalized()
    }
}

Semantic Warning Pattern Learning
Analyze rejection reasons to identify new warning patterns
Add learned patterns to semantic warning detection
Reduce false negatives from undetected issues

Phase 5c: Continuous Evaluation & Retraining¶

Requirements: FR-MD-032

Golden Set Maintenance
Add all approved/rejected pairs to golden test set
Stratify by market category (politics, crypto, sports, etc.)
Target: 100+ pairs per category

Automated Regression Testing

# Run weekly evaluation against golden set
cargo run --features discovery -- --evaluate-matching \
    --golden-set data/golden_pairs.json \
    --output reports/weekly_eval.json

# Alert if metrics degrade
if [ $(jq '.f1_score' reports/weekly_eval.json) < 0.85 ]; then
    notify "Matching quality degraded: F1 < 0.85"
fi

Retraining Pipeline

Weekly Cycle:
1. Export new training data from decisions (Mon)
2. Fine-tune embedding model (Tue)
3. Optimize fingerprint weights (Wed)
4. A/B test new models (Thu-Sat)
5. Promote if metrics improve (Sun)

Model Versioning
Track model versions with performance metrics
Enable rollback if new model underperforms
Maintain audit trail of model changes
Weekly Improvement Cycle

┌─────────────────────────────────────────────────────────────┐
│                  Weekly Improvement Cycle                    │
├─────────────────────────────────────────────────────────────┤
│  Monday:    Export new decisions, update golden set         │
│  Tuesday:   Retrain embedding model, optimize weights       │
│  Wednesday: Validate new models on golden set               │
│  Thursday-Saturday: A/B test (10% traffic)                  │
│  Sunday:    Promote if improved, rollback if degraded       │
└─────────────────────────────────────────────────────────────┘

Extended Requirements (Phase 3-5)¶

ID	Requirement	Phase	Priority
FR-MD-018	Embedding model evaluation and selection	3a	Should
FR-MD-019	Vector storage integration (sqlite-vec or pgvector)	3a	Should
FR-MD-020	Batch embedding generation pipeline	3a	Should
FR-MD-021	Hybrid scoring (fingerprint + embedding + text)	3b	Should
FR-MD-022	Confidence calibration for hybrid scores	3b	Should
FR-MD-023	Contrastive fine-tuning on approved pairs	3c	Could
FR-MD-024	LLM verification prompt engineering	4a	Should
FR-MD-025	Cost-optimized LLM invocation strategy	4a	Should
FR-MD-026	Automated escalation to LLM for uncertain matches	4b	Should
FR-MD-027	Resolution criteria deep analysis via LLM	4c	Could
FR-MD-028	Decision logging with feedback data	5a	Should
FR-MD-029	Training data export pipeline	5a	Should
FR-MD-030	Automatic entity alias learning	5b	Should
FR-MD-031	Fingerprint weight optimization from feedback	5b	Could
FR-MD-032	Continuous evaluation and retraining pipeline	5c	Could

Operations & Deployment¶

CI/CD Pipeline¶

The discovery feature uses Cargo feature flags for conditional compilation. CI/CD pipelines must explicitly enable the feature for testing.

GitHub Actions Configuration (.github/workflows/ci.yml):

discovery-tests:
  name: Discovery Feature Tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Install Rust
      uses: dtolnay/rust-action@stable
    - name: Run discovery unit tests
      run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery
    - name: Run discovery integration tests
      run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
      env:
        KALSHI_DEMO_KEY_ID: ${{ secrets.KALSHI_DEMO_KEY_ID }}
        KALSHI_DEMO_PRIVATE_KEY: ${{ secrets.KALSHI_DEMO_PRIVATE_KEY }}

Feature Flag Management: - Development: cargo run --features discovery -- --discover-markets - Production: Enable via ECS task definition environment variable CARGO_FEATURES=discovery - Gradual rollout: Percentage-based feature flag (future via LaunchDarkly or similar)

Deployment Pipeline: 1. PR → CI tests (unit + integration with mocked APIs) 2. Merge to main → Build Docker image with discovery feature 3. Deploy to staging → E2E validation against Kalshi demo API 4. Deploy to production → Gradual rollout with monitoring

Container Image Versioning: - Tag convention: v{version}-{git-sha} (e.g., v1.2.0-abc1234) - Latest stable: latest tag points to production-ready image - Image promotion: dev → staging → production via re-tagging - Rollback: Deploy previous image by digest (immutable reference)

Health Check Endpoint:

GET /health/discovery
Response: 200 OK
{
  "status": "healthy|degraded|unhealthy",
  "last_scan_at": "2026-01-23T10:30:00Z",
  "last_scan_duration_ms": 45000,
  "pending_candidates": 12,
  "api_status": {
    "polymarket": "healthy",
    "kalshi": "healthy"
  }
}

healthy: Scanner running, APIs accessible, no recent errors
degraded: Scanner running but one API unavailable or rate-limited
unhealthy: Scanner not running or database unavailable

Environment Configuration¶

Environment	API Endpoints	Credentials	Purpose
Local	Production APIs	`.env` file	Development
Demo	Kalshi Demo API (`--kalshi-demo`)	Demo credentials	Integration testing
Staging	Production APIs	AWS Secrets Manager	Pre-production validation
Production	Production APIs	AWS Secrets Manager	Live operation

Environment-Specific Settings:

# Local development
DISCOVERY_SCAN_INTERVAL_SECS=3600
DISCOVERY_BATCH_SIZE=100
DISCOVERY_DB_PATH=./discovery.db

# Staging
DISCOVERY_SCAN_INTERVAL_SECS=1800
DISCOVERY_BATCH_SIZE=500
DISCOVERY_DB_PATH=/data/discovery.db

# Production
DISCOVERY_SCAN_INTERVAL_SECS=900
DISCOVERY_BATCH_SIZE=1000
DISCOVERY_DB_PATH=/data/discovery.db

Security Considerations¶

API Credential Management: - Polymarket Gamma API: Public API, no authentication required - Kalshi /v2/markets: Uses existing RSA-PSS authentication (ADR-009) - Key Rotation: Leverage existing key rotation infrastructure via AWS Secrets Manager - Secret Storage: All credentials stored in AWS Secrets Manager, never in code or config files

Rate Limiting Compliance:

Platform	Rate Limit	Implementation	Monitoring
Polymarket Gamma	60 req/min	Token bucket limiter	CloudWatch `discovery/api/rate_limit_errors`
Kalshi	100 req/min	Existing client limiter	CloudWatch `discovery/api/rate_limit_errors`

Audit Trail Requirements (FR-MD-009): - All candidate approvals/rejections logged with timestamp, reviewer ID, reason - Decision context preserved (scores, features, warnings acknowledged) - JSONL format for compliance export: discovery_audit.jsonl - Retention: 7 years per financial services compliance

Data Privacy: - Market data cached locally in SQLite (no PII) - Embeddings computed locally (Phase 3a uses local models) - No market data exported to external services without explicit configuration

Testing Strategy¶

Test Pyramid:

         /\
        /  \  E2E (Demo environment, Kalshi demo API)
       /----\
      /      \  Integration (wiremock for HTTP mocking)
     /--------\
    /          \  Unit tests (48 existing, inline in modules)
   /--------------\

Unit Tests (48 existing): - candidate.rs: 5 tests (CandidateMatch types, status transitions) - storage.rs: 7 tests (SQLite CRUD, audit logging) - normalizer.rs: 3 tests (text normalization) - matcher.rs: 7 tests (similarity scoring, pre-filtering) - polymarket_gamma.rs: 4 tests (API client, pagination) - kalshi_markets.rs: 4 tests (API client, mve_filter) - scanner.rs: 5 tests (deduplication, batch processing) - approval.rs: 5 tests (warning acknowledgment, safety gates) - cli.rs: 8 tests (CLI argument parsing, integration)

Run with: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery

Integration Tests (wiremock): - Mock Gamma API responses with realistic market data - Mock Kalshi API responses including pagination cursors - Test rate limiting behavior under load - Test error handling and retry logic

Run with: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored

E2E Tests (Demo environment): - Full scan against Kalshi demo API (--kalshi-demo flag) - Validate candidate generation produces realistic results - Manual approval workflow verification - Audit log generation verification

Run with: cargo run --manifest-path arbiter-engine/Cargo.toml --features discovery -- --kalshi-demo --discover-markets

Golden Set Validation: - Known market pairs maintained in tests/golden_pairs.json - Automated weekly evaluation via CI job - Alert on F1 score degradation below 0.85 threshold

Production Readiness¶

Monitoring Metrics (CloudWatch):

Metric	Type	Description	Alert Threshold
`discovery/scan/duration_ms`	Timer	Scan cycle duration	> 5 min (P2)
`discovery/scan/errors`	Counter	Scan failures	3 consecutive (P1)
`discovery/candidates/generated`	Gauge	Candidates per scan	-
`discovery/candidates/approved`	Counter	Approval count	-
`discovery/candidates/approval_rate`	Gauge	Approval rate	< 10% (P2)
`discovery/api/rate_limit_errors`	Counter	Rate limit violations	> 10/hour (P2)
`discovery/api/latency_ms`	Timer	API response times	p99 > 1s (P3)

Alerting Rules:

Priority	Condition	Action
P1	Scan failure (3 consecutive)	Page on-call, investigate immediately
P2	Approval rate < 10%	Slack alert, review matching thresholds
P2	Rate limit errors > 10/hour	Slack alert, increase scan interval
P3	Scan duration > 5 minutes	Slack alert, review batch size

Logging (tracing): - Structured JSON output via tracing-subscriber - Log levels: ERROR (failures), WARN (rate limits), INFO (scan results), DEBUG (matching details) - Correlation IDs for request tracing

Scaling Considerations:

Phase	Configuration	Capacity
MVP	Single scanner, SQLite, hourly scan	~2,000 markets/platform
Scale	Leader election, PostgreSQL, 15-min scan	~10,000 markets/platform
Embedding	Dedicated embedding service, pgvector	~50,000 markets

Disaster Recovery:

Scenario	RTO	RPO	Procedure
Scanner crash	5 min	0	ECS auto-restart
Database corruption	30 min	24 hr	Restore from S3 backup
API outage	N/A	N/A	Graceful degradation, retry with backoff
Region failure	4 hr	1 hr	Cross-region restore from S3

Backup Strategy: - SQLite: Daily S3 backup via ECS scheduled task - PostgreSQL (future): Aurora automated backups (7-day retention) - Audit logs: S3 archival (90-day hot, 7-year cold storage)

Graceful Degradation: - API failure: Retry with exponential backoff, continue with available platform - Database failure: Read-only mode, serve cached candidates - Embedding service down (Phase 3): Fallback to fingerprint-only matching

Runbook¶

1. Discovery scan not running:

# Check feature flag is enabled
kubectl logs -l app=trading-core | grep "discovery"

# Verify environment variable
kubectl exec -it trading-core-xxx -- env | grep DISCOVERY

# Check scanner actor health
kubectl exec -it trading-core-xxx -- curl localhost:8080/health/discovery

2. High rate limit errors:

# Check current rate limit metrics
aws cloudwatch get-metric-statistics --namespace arbiter --metric-name "discovery/api/rate_limit_errors"

# Increase scan interval temporarily
kubectl set env deployment/trading-core DISCOVERY_SCAN_INTERVAL_SECS=7200

# Review API quotas with platform

3. Low approval rate (<10%):

# Review recent candidates
cargo run --features discovery -- --list-candidates --status pending --limit 20

# Check matching threshold configuration
grep "threshold" config/discovery.toml

# Review fingerprint weight configuration
grep "weight" config/discovery.toml

4. Database corruption:

# Stop scanner
kubectl scale deployment/trading-core --replicas=0

# Restore from S3 backup
aws s3 cp s3://arbiter-backups/discovery/latest.db /data/discovery.db

# Verify integrity
sqlite3 /data/discovery.db "PRAGMA integrity_check"

# Restart scanner
kubectl scale deployment/trading-core --replicas=1

Incident Response Summary: - P1 (Scanner failure): PagerDuty alert → On-call acknowledges → Investigate logs → Escalate if not resolved in 30 min - P2 (Quality degradation): Slack alert → Next business day review → Adjust thresholds - P3 (Performance warning): Ticket created → Sprint backlog

Consequences¶

Positive¶

Higher accuracy: Fingerprint matching catches semantic equivalence
Explainable: Field-by-field comparison is auditable
Extensible: Easy to add new entity types, fields, or ML models
Industry-aligned: Matches approach used by successful tools

Negative¶

Increased complexity: More code to maintain
Entity extraction errors: NER may miss or misclassify entities
Requires tuning: Field weights need empirical optimization

Neutral¶

Existing code preserved: Text similarity remains as fallback
Human review unchanged: FR-MD-003 safety gate preserved

Safety Guarantees (Unchanged)¶

Human-in-the-loop preserved: Candidates require explicit approval
FR-MD-003 maintained: Uses existing MappingManager.verify_mapping()
Semantic warnings block quick approval: Must acknowledge settlement differences
Audit trail: All approvals/rejections logged with timestamp and reviewer
No automated trading on unverified pairs: Matches existing safety architecture

References¶

Industry Tools¶

pmxt - Unified API for prediction markets
Dome API - Developer infrastructure for prediction markets
Matchr - Cross-platform market aggregator (1,500+ curated matches)
EventArb - Cross-platform arbitrage calculator
Polymarket-Kalshi-Arbitrage-Bot - Open-source bot with entity extraction

Research¶

API Documentation¶

Kalshi API - mve_filter parameter
Polymarket Gamma API