Skip to content

ADR 017: Automated Market Discovery and Matching

Status

Component Status
Phase 1: Text Similarity ✅ Accepted & Implemented
Phase 2: Fingerprint Matching ✅ Complete (24 tests)
Phase 3: Embedding Matching ✅ Complete (15 tests)
Phase 4: LLM Verification ✅ Complete (15 tests)
Phase 5: Feedback Learning ✅ Complete (23 tests)
Operations & Deployment (docs) ✅ Accepted (NFR-DISC requirements proposed)

Note: Code samples in Phases 2-5 are illustrative designs, not approved implementations. Final implementation may differ based on validation results.

Revision Summary

Post-implementation testing revealed that pure text similarity (Jaccard + Levenshtein) is insufficient for cross-platform market matching. Real-world market pairs have low lexical similarity despite semantic equivalence:

Market Pair Jaccard Score
Kalshi: "Will Trump buy Greenland?" vs Polymarket: "Will the US acquire part of Greenland in 2026?" 8.3%
Kalshi: "Will Washington win the 2026 Pro Football Championship?" vs Polymarket: "Super Bowl Champion 2026" 9.1%

These scores fall far below the 60% threshold, missing obvious matches. This ADR revision proposes a fingerprint-based matching approach aligned with industry best practices.


Implementation Notes (Phase 1 - Text Similarity - Completed)

Completed: 2026-01-22

Phase 1 (Text Similarity) implemented in 5 sub-phases with 48 tests (377 total tests passing):

Sub-Phase Focus Tests Status
1a Data Types & Storage 12 Complete
1b Text Matching Engine 10 Complete
1c Discovery API Clients 8 Complete
1d Scanner & Approval Workflow 10 Complete
1e CLI Integration 8 Complete

Note: These are sub-phases of Phase 1 only. Phases 2-5 (Fingerprint, Embedding, LLM, Feedback) are Proposed - see roadmap below.

Council Verified: Each sub-phase passed LLM Council review with confidence >= 0.87

GitHub Issues: #41-#48


Implementation Notes (Phase 2a - Fingerprint & Entity Extraction)

Completed: 2026-01-23

Phase 2a (Fingerprint Foundation) implemented with 10 tests:

File Tests Purpose
src/discovery/entity_extractor.rs 5 Rule-based NER for persons, crypto, prices, dates, events
src/discovery/fingerprint.rs 5 MarketFingerprint extraction with event type detection

Key Components: - EntityExtractor: Pattern-based extraction using regex for persons (Trump, Biden), crypto (BTC→Bitcoin), price targets ($100k), dates (Q2 2026), events (Super Bowl) - MarketFingerprint: Structured representation with entity, event_type, metric spec, resolution window, outcome type - EventType detection: PriceTarget, Election, Acquisition, Announcement, SportingEvent, EconomicIndicator

Dependencies Added: - regex = "1.10" for pattern matching - lazy_static = "1.4" for compiled regex caching

GitHub Issue: #49


Implementation Notes (Phase 2b - FingerprintMatcher & EntityIndex)

Completed: 2026-01-23

Phase 2b (Fingerprint Matching) implemented with 8 tests:

File Tests Purpose
src/discovery/entity_index.rs 3 Inverted index for O(1) entity lookup, alias database
src/discovery/fingerprint_matcher.rs 5 Field-weighted scoring (entity 30%, date 25%, threshold 20%, outcome 15%, source 10%)

Key Components: - EntityIndex: Inverted index mapping entity names to market IDs - AliasDatabase: Canonical name resolution (BTC→Bitcoin, Pro Football Championship→Super Bowl) - FingerprintMatcher: Weighted scoring with configurable weights and thresholds - ScoreBreakdown: Detailed per-field score transparency

Field Weights: | Field | Weight | Comparison Method | |-------|--------|-------------------| | Entity | 0.30 | Exact or alias match | | Date | 0.25 | Year/quarter/month overlap | | Threshold | 0.20 | Numeric comparison with 5% tolerance | | Outcome | 0.15 | Binary vs multi-choice | | Source | 0.10 | Event type match (placeholder) |

GitHub Issue: #50


Implementation Notes (Phase 2c - Integration & Validation)

Completed: 2026-01-23

Phase 2c (Integration & Validation) implemented with 6 tests:

File Tests Purpose
src/discovery/integration_tests.rs 6 Golden pair validation, evaluation metrics
tests/golden_pairs.json - Test data for known market pairs

Golden Test Pairs: | ID | Expected Match | Min Score | |----|----------------|-----------| | greenland-acquisition | true | 0.60 | | super-bowl-2026 | true | 0.55 | | btc-100k | true | 0.75 | | fed-rate-cut | true | 0.65 | | different-threshold-negative | false | - | | different-year-negative | false | - |

Evaluation Framework: - GoldenTestData: Load and validate against golden pairs - EvaluationResult: Precision, recall, F1 metrics - Automated accuracy tracking for CI/CD

GitHub Issue: #51

Phase 2 Complete: - FR-MD-011: Fingerprint extraction ✅ - FR-MD-012: Entity-based candidates ✅ - FR-MD-013: Weighted scoring ✅ - FR-MD-014: Rule-based NER ✅


Implementation Notes (Phase 3a - Embedding Infrastructure)

Completed: 2026-01-23

Phase 3a (Embedding Infrastructure) implemented with 9 tests:

File Tests Purpose
src/discovery/embedding.rs 6 Embedding generation, cosine similarity, serialization
src/discovery/vector_store.rs 3 SQLite storage, nearest neighbor search

Key Components: - Embedding: Vector representation with cosine similarity - HashEmbedder: Development/testing embedder using SHA-256 - VectorStore: SQLite-based embedding storage with nearest neighbor search - Trait-based design for future ONNX/API backends

Embedding Interface:

pub trait Embedder: Send + Sync {
    fn embed(&self, text: &str) -> Embedding;
    fn embed_batch(&self, texts: &[&str]) -> Vec<Embedding>;
    fn dimension(&self) -> usize;
}

GitHub Issue: #52


Implementation Notes (Phase 3b - Hybrid Scoring)

Completed: 2026-01-23

Phase 3b (Hybrid Scoring) implemented with 6 tests:

File Tests Purpose
src/discovery/hybrid_scorer.rs 6 Combined fingerprint + embedding + text scoring

Key Components: - HybridScorer: Combines all matching signals with configurable weights - HybridWeights: α=0.50 fingerprint, β=0.40 embedding, γ=0.10 text - HybridScoreBreakdown: Full transparency of each component - calibrate_score(): Linear calibration (placeholder for isotonic regression)

Hybrid Formula:

score = 0.50 × fingerprint + 0.40 × embedding + 0.10 × text

GitHub Issue: #53

Phase 3 Complete: - FR-MD-018: Embedding infrastructure ✅ - FR-MD-019: Vector storage ✅ - FR-MD-020: Batch embeddings ✅ - FR-MD-021: Hybrid scoring ✅ - FR-MD-022: Confidence calibration ✅


Implementation Notes (Phase 4 - LLM Verification)

Completed: 2026-01-23

Phase 4 (LLM Verification) implemented with 15 tests:

File Tests Purpose
src/discovery/prompts.rs 2 Structured verification prompts, escape_for_prompt()
src/discovery/llm_verifier.rs 6 LLM response parsing, cost tracking with budget enforcement
src/discovery/escalation.rs 7 Tiered escalation (None → Haiku → Sonnet → Human), configurable thresholds

Key Components: - FilledPrompt: Structured prompt with metadata (estimated tokens) - LlmVerifier: Builds prompts, parses JSON responses, tracks costs - LlmCostTracker: Per-request cost tracking with $50/day default budget - EscalationEngine: Tiered escalation based on score uncertainty, warnings, volume

Escalation Levels: | Level | Trigger | Cost | |-------|---------|------| | None | Score ≥ 0.85, no warnings | $0 | | Haiku | Score 0.60-0.85, minor warnings | ~$0.001/verification | | Sonnet | Conflicting signals, major warnings | ~$0.01/verification | | Human | LLM uncertain, resolution differences | Manual review |

GitHub Issues: #54, #55

Phase 4 Complete: - FR-MD-024: LLM verification prompt engineering ✅ - FR-MD-025: Cost-optimized LLM invocation ✅ - FR-MD-026: Automated escalation rules ✅


Implementation Notes (Phase 5 - Feedback Learning)

Completed: 2026-01-23

Phase 5 (Feedback Learning) implemented with 23 tests:

File Tests Purpose
src/discovery/decision_log.rs 7 Decision logging, JSONL export, stratified sampling
src/discovery/alias_learner.rs 5 Alias learning with confidence, in-memory cache
src/discovery/weight_optimizer.rs 6 Gradient-free F1 optimization with bounded weights
src/discovery/evaluation_pipeline.rs 5 Orchestrates all Phase 4-5 components

Key Components: - DecisionLogger: SQLite-backed decision logging with full context preservation - MatchDecision: Captures scores, escalation level, corrections, category - TrainingExample: Exportable format with label (0/1) for model training - AliasLearner: Learns aliases from corrections, confidence grows with confirmations - WeightOptimizer: Gradient-free search for optimal fingerprint/embedding/text weights - EvaluationPipeline: Single entry point that orchestrates scoring, escalation, logging, alias learning

Training Data Export:

// Export approved/rejected pairs for model training
let logger = DecisionLogger::new_in_memory()?;
let training_data = logger.export_to_jsonl()?;
// Returns JSONL with label, scores, category

Alias Learning Flow:

// Learn alias from human correction
learner.learn_from_correction("BTC", "Bitcoin", "human_approval")?;
// Confidence increases with each confirmation: 1.0 - (1.0 / (count + 1.0))

Weight Optimization:

// Optimize weights from historical decisions
let optimizer = WeightOptimizer::new();
let result = optimizer.optimize(&examples);
// Returns OptimizationResult with improved F1 score

GitHub Issues: #56, #57

Phase 5 Complete: - FR-MD-028: Decision logging with context ✅ - FR-MD-029: Training data export pipeline ✅ - FR-MD-030: Automatic entity alias learning ✅ - FR-MD-031: Fingerprint weight optimization ✅


Post-Implementation Learnings

Problem 1: Text Similarity Misses Semantic Equivalence

The current algorithm uses score = 0.6 × Jaccard(tokens) + 0.4 × Levenshtein_normalized. This approach fails because:

  1. Synonym blindness: "Super Bowl" ≠ "Pro Football Championship" lexically
  2. Paraphrase blindness: "Trump buy" ≠ "US acquire" semantically equivalent but zero overlap
  3. Dilution by stop words: "Greenland" signal diluted by "Will", "the", "in", etc.

Problem 2: API Sort Order Returns Different Market Types

  • Kalshi default sort: Returns high-volume sports/weather markets first
  • Polymarket default sort: Returns high-volume politics/crypto markets first
  • Even scanning 2000 markets per platform yields minimal overlap in market categories

Problem 3: Multivariate Event Filtering Required

Kalshi's API returns sports parlays by default. Required adding mve_filter=exclude to get prediction markets.

Industry Validation

Research of existing cross-platform arbitrage solutions confirms these findings:

Tool Matching Approach
pmxt Manual slug-based configuration
Dome API Unified API with manual market mapping
Matchr Curated match database (1,500+ markets)
EventArb Manual market selection with arb calculator
Polymarket-Kalshi-Arbitrage-Bot "Intelligent matching" via entity extraction + text similarity

Key insight: No tool relies solely on text similarity. All use either: - Manual curation/configuration - Entity extraction + structured field matching ("fingerprinting") - Curated databases of known matches


Prior Council Feedback

This ADR directly addresses concerns raised in the LLM Council Design Reviews:

Design Review 1 (FR-MD-002 Fuzzy Matching): "DANGEROUS. Downgrade to 'Candidate Proposal' only. Require human sign-off."

Design Review 1 (FR-MD-003): "No arb should execute on a mapped pair without a signed human verification bit."

Design Review 2 (Approved): "Safety Gates (FR-MD-003): Requiring human confirmation for market mapping prevents catastrophic 'bad data' trades (e.g., mapping 'Trump' to 'Trump Jr')."


Context

The arbiter-bot currently requires manual identification and configuration of market pairs between Polymarket and Kalshi. Market tickers are hardcoded (e.g., KXBTC-25JAN31-B95000), creating operational overhead:

  1. Manual discovery burden: Operators must research markets on both platforms independently
  2. Missed opportunities: New markets may go undetected
  3. No persistent mapping store: Mappings exist only in memory
  4. Scaling limitation: Cannot efficiently monitor thousands of markets

Industry Context

Academic research from IMDEA Networks Institute documented over $40 million in arbitrage profits from Polymarket alone (April 2024 - April 2025). Existing arbitrage bots (e.g., polymarket-arbitrage) watch 10,000+ markets using automated matching. Cross-platform studies show ~6% of 102,275 events have semantic relations across venues.

Critical Constraint

Settlement semantics differ across platforms. The 2024 government shutdown case illustrates: - Polymarket: "OPM issues shutdown announcement" - Kalshi: "Actual shutdown exceeding 24 hours"

Same event, different resolution criteria, potentially different outcomes. Human verification remains mandatory per existing requirement FR-MD-003.


Decision

Implement an automated market discovery and matching system using a three-stage fingerprint-based pipeline:

  1. Stage 1: Candidate Generation - Fast narrowing by keywords, dates, categories
  2. Stage 2: Fingerprint Matching - Structured field comparison with weighted scoring
  3. Stage 3: Human Verification - Resolution criteria review with semantic warnings

Revised Options Analysis

Option A: Pure Text Similarity (Current - Insufficient)

Criterion Assessment
Accuracy Low - misses semantic matches
Cost No per-match API costs (uses cached market data)
Latency Sub-millisecond per comparison
Explainability High (score breakdown visible)
Verdict Insufficient for production use

Evidence: Real market pairs score 8-9% similarity, far below any reasonable threshold.

Option B: Fingerprint-Based Matching (Proposed)

Criterion Assessment
Accuracy High - matches on structured fields
Cost No per-match API costs (local processing on cached data)
Latency ~10ms per comparison (entity extraction)
Explainability High (field-by-field comparison)
Complexity Medium (requires entity extraction)

Algorithm: 1. Extract "market fingerprint" with structured fields 2. Match on canonical fields (entity, date, threshold, resolution source) 3. Score similarity across fields with appropriate weights 4. Generate candidates for human review

Option C: Embedding-Based Semantic Matching (Future Enhancement)

Criterion Assessment
Accuracy Highest - captures semantic meaning
Cost ~$0.0001 per embedding (or local model)
Latency +50-200ms per embedding
Complexity High (embedding service, vector DB)
Verdict Consider for Phase 3 enhancement (after fingerprint foundation)

Option D: Hybrid Fingerprint + LLM Verification (Future Enhancement)

Criterion Assessment
Accuracy Highest - LLM catches edge cases
Cost ~$0.01-0.05 per verification
Latency +200-500ms per LLM call
Verdict Consider for high-value market verification

Option E: External Service Integration (Alternative)

Criterion Assessment
Accuracy High (curated by service provider)
Cost API subscription fees
Dependency External service availability
Candidates Matchr (curated DB), Dome (unified API)
Verdict Consider as fallback or validation source

Rationale for Option B (Fingerprint-Based)

  1. Proven approach: Industry tools (Polymarket-Kalshi-Arbitrage-Bot) use entity extraction
  2. Addresses root cause: Matches on semantic fields, not surface text
  3. No external dependencies: Local processing, no API costs
  4. Extensible: Can add embeddings or LLM verification later
  5. Explainable: Field-by-field comparison is auditable
  6. Council Compliant: Still generates candidates for human review (FR-MD-003)

Revised Architecture

┌─────────────────┐     ┌─────────────────┐
│ Polymarket API  │     │   Kalshi API    │
│ (Gamma endpoint)│     │ (/v2/markets)   │
└────────┬────────┘     └────────┬────────┘
         │                       │
         ▼                       ▼
┌─────────────────────────────────────────┐
│         DiscoveryScannerActor           │
│  - Market enumeration with pagination   │
│  - mve_filter=exclude for Kalshi        │
│  - Category/date pre-filtering          │
└────────────────────┬────────────────────┘
┌─────────────────────────────────────────┐
│       MarketFingerprintExtractor        │
│  - Entity extraction (NER)              │
│  - Date/threshold parsing               │
│  - Resolution source identification     │
│  - Outcome structure normalization      │
└────────────────────┬────────────────────┘
┌─────────────────────────────────────────┐
│          FingerprintMatcher             │
│  Stage 1: Candidate generation (fast)   │
│  Stage 2: Field-by-field scoring        │
│  Stage 3: Semantic warning detection    │
└────────────────────┬────────────────────┘
┌─────────────────────────────────────────┐
│      CandidateMatch (SQLite)            │
│  - Pending / Approved / Rejected        │
│  - Fingerprint diff for review          │
│  - Semantic warnings                    │
└────────────────────┬────────────────────┘
                     ▼  (Human approval via CLI)
┌─────────────────────────────────────────┐
│      MappingManager (existing)          │
│  - propose_mapping() → verify_mapping() │
│  - FR-MD-003 safety gate preserved      │
└─────────────────────────────────────────┘

Market Fingerprint Schema

/// Canonical market fingerprint for cross-platform matching
pub struct MarketFingerprint {
    /// Primary entity (e.g., "Trump", "Bitcoin", "Fed")
    pub entity: String,

    /// Secondary entities (e.g., "Greenland", "Denmark")
    pub secondary_entities: Vec<String>,

    /// Event type (e.g., "acquisition", "election", "price_target")
    pub event_type: EventType,

    /// Metric and direction (e.g., "price >= $100,000")
    pub metric: Option<MetricSpec>,

    /// Geographic scope (e.g., "US", "global")
    pub scope: Option<String>,

    /// Resolution date/time window
    pub resolution_window: ResolutionWindow,

    /// Outcome structure
    pub outcome_type: OutcomeType,  // Binary | MultiOutcome | Range

    /// Resolution source (e.g., "BLS", "AP", "FOMC")
    pub resolution_source: Option<String>,

    /// Original title (for reference)
    pub original_title: String,
}

pub struct MetricSpec {
    pub name: String,           // "price", "rate", "count"
    pub direction: Direction,   // Above | Below | Between | Exactly
    pub threshold: Decimal,     // Use rust_decimal for financial precision
    pub unit: Option<String>,   // "$", "%", "basis points"
}
// Note: Use rust_decimal::Decimal for threshold to avoid floating-point
// precision issues in financial comparisons (e.g., 0.1 + 0.2 != 0.3 in f64)

pub struct ResolutionWindow {
    pub date: Option<NaiveDate>,
    pub time: Option<NaiveTime>,
    pub timezone: Option<Tz>,   // Use chrono_tz::Tz for type-safe timezones
    pub tolerance_days: i32,    // For fuzzy date matching
}
// Note: All times should be normalized to UTC for comparison.
// Local timezone is preserved for display purposes only.

Revised Matching Algorithm

Stage 1: Candidate Generation (Fast Narrowing)

FOR each Kalshi market K:
    1. Extract keywords from K.title + K.rules
    2. Query Polymarket index by:
       - Keyword overlap (BM25 or inverted index)
       - Resolution date proximity (±14 days)
       - Category match (if available)
    3. Return top N candidates (N=50)

Complexity: O(n log n) with inverted index

Stage 2: Fingerprint Matching (Weighted Scoring)

FOR each candidate pair (K, P):
    fingerprint_K = extract_fingerprint(K)
    fingerprint_P = extract_fingerprint(P)

    score = weighted_sum([
        (entity_match(K, P),           weight=0.30),  # Primary entity
        (date_match(K, P),             weight=0.25),  # Resolution date
        (threshold_match(K, P),        weight=0.20),  # Numeric thresholds
        (outcome_match(K, P),          weight=0.15),  # Binary vs multi
        (resolution_source_match(K, P), weight=0.10), # Data source
    ])

    IF score >= 0.70:
        create_candidate(K, P, score)

Weight Rationale & Validation Plan:

Field Weight Rationale Validation Method
Entity 0.30 Primary disambiguator (Trump vs Biden) A/B test on historical pairs
Date 0.25 Critical for time-bound events Precision/recall on date-similar pairs
Threshold 0.20 Important for numeric markets (price targets) Manual review of 50 threshold markets
Outcome 0.15 Binary vs multi-outcome affects pairing Confusion matrix analysis
Source 0.10 Settlement source differences cause disputes Historical dispute rate analysis

Initial Values: These weights are starting estimates based on domain analysis. They will be validated against a golden set of 100+ manually-verified market pairs before production deployment. The Phase 5 feedback loop will continuously optimize weights based on human approval decisions.

Threshold Validation:

Threshold Value Purpose Validation Criteria
Candidate creation ≥ 0.70 Balance precision/recall Target: Precision ≥ 0.80, Recall ≥ 0.70
Auto-approve (future) ≥ 0.95 High-confidence automation Zero false positives in test set
Uncertain zone 0.70-0.85 Trigger LLM verification (Phase 4) Review rate < 20% of candidates

Empirical Validation Required: Before enabling auto-approval or LLM escalation, a minimum of 100 human decisions must be collected to calibrate thresholds and measure actual precision/recall.

Stage 3: Semantic Warning Detection

warnings = []

IF K.resolution_source != P.resolution_source:
    warnings.append("Different resolution sources")

IF abs(K.resolution_date - P.resolution_date) > 1 day:
    warnings.append("Resolution dates differ by {days}")

IF K.rules contains "announcement" AND P.rules contains "actual":
    warnings.append("Announcement vs actual event timing")

IF K.outcome_count != P.outcome_count:
    warnings.append("Different outcome structures")

# Require explicit acknowledgment before approval

Entity Extraction Approaches

/// Extract entities using pattern matching
fn extract_entities(title: &str, rules: &str) -> Vec<Entity> {
    let mut entities = Vec::new();

    // Known entity patterns
    let patterns = [
        (r"(?i)\b(Trump|Biden|Harris|Obama)\b", EntityType::Person),
        (r"(?i)\b(Bitcoin|BTC|Ethereum|ETH)\b", EntityType::Crypto),
        (r"(?i)\b(Fed|FOMC|CPI|GDP|NFP)\b", EntityType::Economic),
        (r"(?i)\b(Super Bowl|World Series|NBA Finals)\b", EntityType::Sports),
        (r"(?i)\b(Greenland|Ukraine|Taiwan)\b", EntityType::Location),
        (r"\$[\d,]+(?:k|K|M|B)?", EntityType::PriceTarget),
        (r"(?i)\b(20\d{2})\b", EntityType::Year),
    ];

    for (pattern, entity_type) in patterns {
        // Extract and deduplicate matches
    }

    entities
}

Option 2: ML-Based NER (Future Enhancement)

Use a lightweight NER model (e.g., spaCy, Hugging Face transformers) for more robust entity extraction:

# Example with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")

def extract_entities_ml(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

Option 3: LLM-Based Extraction (Future Enhancement)

Use an LLM to extract structured fingerprints:

Prompt: Extract a structured fingerprint from this market:
Title: "Will Trump buy Greenland?"
Rules: "Resolves Yes if US purchases any part of Greenland from Denmark before Jan 20, 2029"

Expected output:
{
  "entity": "Trump",
  "secondary_entities": ["Greenland", "Denmark", "US"],
  "event_type": "acquisition",
  "resolution_date": "2029-01-20",
  "resolution_source": null
}

Module Structure (Revised)

arbiter-engine/src/
├── discovery/
│   ├── mod.rs
│   ├── scanner.rs                  # DiscoveryScannerActor
│   ├── normalizer.rs               # Text normalization
│   ├── matcher.rs                  # Text similarity (existing)
│   ├── fingerprint.rs              # NEW: Fingerprint extraction
│   ├── fingerprint_matcher.rs      # NEW: Field-based matching
│   ├── entity_extractor.rs         # NEW: Entity extraction (NER)
│   ├── candidate.rs                # CandidateMatch types
│   ├── storage.rs                  # SQLite persistence
│   └── approval.rs                 # Human approval workflow
└── market/
    └── discovery_client/
        ├── mod.rs
        ├── polymarket_gamma.rs
        └── kalshi_markets.rs       # With mve_filter=exclude

CLI Interface (Enhanced)

# Discover markets with fingerprint matching
cargo run --features discovery -- --discover-markets --verbose

# Show fingerprint for a specific market (debugging)
cargo run --features discovery -- --show-fingerprint --ticker "KXGREENLAND-29"

# Review candidates with fingerprint diff
cargo run --features discovery -- --review-candidates

# Import from external matching service (future)
cargo run --features discovery -- --import-matches --source matchr

External Service Integration (Future)

For validation or as a fallback, integrate with existing matching services:

/// External matching service trait
#[async_trait]
pub trait ExternalMatchingService {
    /// Query for known matches of a market
    async fn find_matches(&self, market_id: &str) -> Result<Vec<ExternalMatch>, Error>;

    /// Validate a proposed match
    async fn validate_match(&self, pair: &MarketPair) -> Result<ValidationResult, Error>;
}

/// Implementations
pub struct MatchrClient { /* curated database */ }
pub struct DomeClient { /* unified API */ }
pub struct PmxtClient { /* open-source library */ }

Requirements Traceability

Existing Requirements Implemented:

ID Requirement Status Implementation
FR-MD-001 Persistent cache of mappings Complete SQLite storage
FR-MD-002 Fuzzy matching as suggestion engine only Revision needed Fingerprint matching
FR-MD-003 Human Confirmation required Complete CLI approval workflow
FR-MD-004 Auto-discover markets by expiration Complete Scanner with date filter
FR-MD-005 Track resolution status and dates Complete DiscoveredMarket fields
FR-MD-006 Enumerate Polymarket Gamma API Complete polymarket_gamma.rs
FR-MD-007 Enumerate Kalshi /v2/markets Complete kalshi_markets.rs
FR-MD-008 Semantic warning detection Complete Warning flags
FR-MD-009 Audit logging Complete JSONL audit log

New/Revised Requirements:

ID Requirement Phase Priority
FR-MD-011 Fingerprint extraction from market titles/rules 2a Must
FR-MD-012 Entity-based candidate generation 2a Must
FR-MD-013 Field-weighted similarity scoring 2b Must
FR-MD-014 Rule-based named entity recognition 2a Must
FR-MD-015 ML-based NER integration 2c Could
FR-MD-016 Embedding-based semantic matching 3b Should
FR-MD-017 External service integration (Matchr/Dome) 3c Could
FR-MD-018 Embedding model evaluation and selection 3a Should
FR-MD-019 Vector storage integration 3a Should
FR-MD-020 Batch embedding generation pipeline 3a Should
FR-MD-021 Hybrid scoring algorithm 3b Should
FR-MD-022 Confidence calibration 3b Should
FR-MD-023 Contrastive fine-tuning on approved pairs 3c Could
FR-MD-024 LLM verification prompt engineering 4a Should
FR-MD-025 Cost-optimized LLM invocation 4a Should
FR-MD-026 Automated escalation to LLM 4b Should
FR-MD-027 Resolution criteria deep analysis 4c Could
FR-MD-028 Decision logging with feedback 5a Should
FR-MD-029 Training data export pipeline 5a Should
FR-MD-030 Automatic entity alias learning 5b Should
FR-MD-031 Fingerprint weight optimization 5b Could
FR-MD-032 Continuous evaluation and retraining 5c Could

Migration Path

Phase 2a: Fingerprint Foundation

  1. Implement MarketFingerprint struct
  2. Implement rule-based entity extractor
  3. Add fingerprint storage to SQLite schema
  4. Unit tests for extraction accuracy

Phase 2b: Fingerprint Matcher

  1. Implement FingerprintMatcher with weighted scoring
  2. Replace text similarity as primary matching method
  3. Keep text similarity as fallback/tiebreaker
  4. Integration tests with real market data

Phase 2c: Validation & Tuning

  1. Test against known market pairs (Greenland, Super Bowl, etc.)
  2. Tune field weights based on precision/recall
  3. Add ML-based NER if rule-based insufficient
  4. Council review of revised implementation

Phase 3: Embedding-Based Semantic Matching (Option C)

Goal: Add vector embedding similarity as a complementary matching signal that captures semantic meaning beyond structured fields.

Phase 3a: Embedding Infrastructure

Requirements: FR-MD-018, FR-MD-019, FR-MD-020

  1. Model Selection
  2. Evaluate embedding models for prediction market domain:
    • all-MiniLM-L6-v2 (384 dims, fast, local)
    • text-embedding-3-small (1536 dims, OpenAI API)
    • voyage-finance-2 (1024 dims, finance-tuned)
  3. Benchmark on golden set of known market pairs
  4. Selection criteria: F1 score ≥ 0.85 on domain, latency ≤ 100ms

  5. Vector Storage

  6. Option A: SQLite with sqlite-vec extension (simple, local)
  7. Option B: PostgreSQL with pgvector (scalable, production)
  8. Option C: FAISS index with SQLite metadata (fast ANN search)
  9. Schema addition:

    ALTER TABLE discovered_markets ADD COLUMN embedding BLOB;
    CREATE INDEX idx_embedding ON discovered_markets USING ivfflat (embedding vector_cosine_ops);
    

  10. Embedding Pipeline

  11. Batch embedding generation during market discovery scan
  12. Incremental updates for new markets
  13. Cache embeddings to avoid recomputation

Phase 3b: Hybrid Matching Integration

Requirements: FR-MD-021, FR-MD-022

  1. Semantic Candidate Generation

    impl SemanticMatcher {
        /// Find semantically similar markets using embedding search
        pub async fn find_similar(&self, market: &DiscoveredMarket, k: usize) -> Vec<SimilarMarket> {
            let embedding = self.embed(&market.title, &market.description).await?;
            self.vector_store.nearest_neighbors(&embedding, k).await
        }
    }
    

  2. Hybrid Scoring Algorithm

    final_score = α × fingerprint_score + β × embedding_similarity + γ × text_similarity
    
    Where (configurable, default values):
    - α = 0.50 (fingerprint weight)
    - β = 0.40 (embedding weight)
    - γ = 0.10 (text similarity fallback)
    

Validation Plan: Hybrid weights will be determined empirically via grid search over the golden set. Initial values are estimates based on: (1) fingerprint provides structured matching, (2) embeddings capture semantics, (3) text similarity as fallback for simple cases. Target: F1 ≥ 0.90 on held-out test set.

  1. Confidence Calibration
  2. Track score distributions for true/false matches
  3. Calibrate thresholds to achieve target precision/recall
  4. Separate thresholds for different market categories

Phase 3c: Domain Adaptation & Fine-Tuning

Requirements: FR-MD-023

  1. Training Data Collection
  2. Export approved match pairs as positive examples
  3. Export rejected pairs as negative examples
  4. Export human-modified entity mappings
  5. Target: 500+ labeled pairs before fine-tuning

  6. Contrastive Fine-Tuning

Note: This Python code is for the ML training pipeline only (offline batch process). The trained model is exported to ONNX format for use in the Rust runtime via ort (ONNX Runtime) crate.

# Fine-tune embedding model on prediction market pairs
from sentence_transformers import SentenceTransformer, losses

# Load pre-trained model as starting point
model = SentenceTransformer('all-MiniLM-L6-v2')

# ContrastiveLoss pulls matching pairs together in embedding space
# while pushing non-matching pairs apart. This teaches the model
# that "Super Bowl" and "Pro Football Championship" should be close.
train_loss = losses.ContrastiveLoss(model)

# Train on (anchor, positive, negative) triplets from human decisions
# - Anchor: Kalshi market title
# - Positive: Approved Polymarket match
# - Negative: Rejected Polymarket candidate (hard negative)
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,        # Few epochs to avoid overfitting on small dataset
    warmup_steps=100 # Gradual learning rate increase
)
  1. A/B Testing
  2. Deploy fine-tuned model alongside base model
  3. Compare precision/recall on new market pairs
  4. Gradual rollout based on performance

Phase 4: LLM-Based Verification (Option D)

Goal: Use LLM reasoning for high-confidence verification of uncertain matches and deep resolution criteria analysis.

Phase 4a: LLM Verification Pipeline

Requirements: FR-MD-024, FR-MD-025

  1. Structured Verification Prompt

    You are a prediction market analyst. Determine if these two markets are equivalent.
    
    MARKET A (Kalshi):
    - Title: "{kalshi_title}"
    - Resolution: "{kalshi_rules}"
    - Expiration: {kalshi_date}
    
    MARKET B (Polymarket):
    - Title: "{poly_title}"
    - Resolution: "{poly_rules}"
    - Expiration: {poly_date}
    
    Analyze:
    1. Are they about the SAME underlying event? (not just similar topics)
    2. Would "Yes" on Market A correspond to "Yes" on Market B?
    3. Are the resolution criteria compatible? List any differences.
    4. Could different resolution timing cause different outcomes?
    
    Output JSON:
    {
      "equivalent": true|false,
      "confidence": 0.0-1.0,
      "reasoning": "...",
      "warnings": ["..."],
      "resolution_differences": ["..."]
    }
    

  2. Cost-Optimized Invocation

  3. Only invoke LLM for:
    • Fingerprint score between 0.70-0.85 (uncertain zone)
    • High-value markets (volume > $10k)
    • Markets with semantic warnings
  4. Use Claude Haiku for initial screening ($0.001/verification)
  5. Escalate to Claude Sonnet for complex cases ($0.01/verification)
  6. Budget cap: $50/day default, configurable

  7. Response Parsing & Validation

    #[derive(Deserialize)]
    pub struct LlmVerificationResult {
        pub equivalent: bool,
        pub confidence: f64,
        pub reasoning: String,
        pub warnings: Vec<String>,
        pub resolution_differences: Vec<String>,
    }
    
    impl LlmVerifier {
        pub async fn verify(&self, pair: &CandidateMatch) -> Result<LlmVerificationResult, Error> {
            let prompt = self.build_prompt(pair);
            let response = self.llm_client.complete(&prompt).await?;
            self.parse_response(&response)
        }
    }
    

Phase 4b: Automated Escalation

Requirements: FR-MD-026

  1. Uncertainty Detection
  2. Fingerprint score in "uncertain zone" (0.70-0.85)
  3. Conflicting signals (high entity match, low date match)
  4. Semantic warnings present
  5. Resolution criteria contain complex conditions

  6. Escalation Rules

    impl EscalationPolicy {
        pub fn should_escalate_to_llm(&self, result: &MatchResult) -> bool {
            // Uncertain fingerprint score
            if result.score >= 0.70 && result.score < 0.85 {
                return true;
            }
    
            // High variance in field scores
            let variance = result.field_scores.values().variance();
            if variance > 0.15 {
                return true;
            }
    
            // Semantic warnings present
            if !result.warnings.is_empty() {
                return true;
            }
    
            false
        }
    }
    

  7. Human Review of LLM Decisions

  8. Initially: All LLM-verified matches require human confirmation
  9. After calibration (100+ decisions): Auto-approve if LLM confidence ≥ 0.95
  10. Always require human review if LLM identifies resolution differences

Phase 4c: Resolution Criteria Deep Analysis

Requirements: FR-MD-027

  1. Structured Resolution Comparison

    Analyze the resolution criteria for these markets:
    
    Market A Resolution: "{criteria_a}"
    Market B Resolution: "{criteria_b}"
    
    Extract and compare:
    1. Resolution SOURCE (who determines outcome)
    2. Resolution TIMING (when is outcome determined)
    3. Resolution THRESHOLD (what conditions trigger Yes/No)
    4. Edge cases (what happens if ambiguous)
    
    Output structured comparison with compatibility assessment.
    

  2. Semantic Difference Detection

  3. "Announcement" vs "actual event" timing
  4. Different authoritative sources (AP vs Reuters)
  5. Different thresholds or measurement periods
  6. Geographic scope differences

  7. Human-Readable Reports

  8. Generate side-by-side comparison for human reviewers
  9. Highlight specific text differences
  10. Provide recommendation with confidence level

Phase 5: Reinforcement Learning from Human Feedback

Goal: Create a continuous improvement loop where human approval decisions improve all matching components over time.

Phase 5a: Feedback Data Collection

Requirements: FR-MD-028, FR-MD-029

  1. Decision Logging Schema

    CREATE TABLE match_decisions (
        id UUID PRIMARY KEY,
        candidate_id UUID REFERENCES candidates(id),
        decision ENUM('approved', 'rejected', 'modified'),
        reviewer_id TEXT,
        decision_timestamp TIMESTAMP,
    
        -- Context at decision time
        fingerprint_score REAL,
        embedding_similarity REAL,
        llm_confidence REAL,
    
        -- Feedback data
        rejection_reason TEXT,
        entity_corrections JSONB,  -- {"old": "BTC", "new": "Bitcoin"}
        resolution_notes TEXT,
    
        -- For training
        is_training_example BOOLEAN DEFAULT true
    );
    

  2. Feedback Categories

  3. Entity Corrections: Human corrects entity extraction errors
  4. Alias Additions: Human identifies new synonyms/aliases
  5. False Positive Patterns: Common rejection reasons
  6. Edge Case Documentation: Complex matches with notes

  7. Export for Training

    impl FeedbackExporter {
        /// Export approved pairs as positive training examples
        pub fn export_positive_pairs(&self) -> Vec<TrainingPair> {
            self.storage.query_decisions("approved")
                .map(|d| TrainingPair {
                    anchor: d.kalshi_title,
                    positive: d.poly_title,
                    metadata: d.entity_corrections,
                })
                .collect()
        }
    
        /// Export rejected pairs as hard negatives
        pub fn export_negative_pairs(&self) -> Vec<TrainingPair> {
            self.storage.query_decisions("rejected")
                .filter(|d| d.fingerprint_score > 0.5)  // Hard negatives only
                .map(|d| TrainingPair {
                    anchor: d.kalshi_title,
                    negative: d.poly_title,
                    rejection_reason: d.rejection_reason,
                })
                .collect()
        }
    }
    

Phase 5b: Automatic Improvements

Requirements: FR-MD-030, FR-MD-031

  1. Entity Alias Database Updates

    impl AliasLearner {
        /// Learn new aliases from approved matches
        pub fn learn_from_approval(&mut self, decision: &MatchDecision) {
            if let Some(corrections) = &decision.entity_corrections {
                for (old, new) in corrections {
                    self.alias_db.add_alias(new, old);
                    log::info!("Learned alias: {} -> {}", old, new);
                }
            }
    
            // Also learn implicit aliases from matched pairs
            let kalshi_entities = extract_entities(&decision.kalshi_title);
            let poly_entities = extract_entities(&decision.poly_title);
    
            for (k, p) in self.align_entities(&kalshi_entities, &poly_entities) {
                if k.name != p.name && k.entity_type == p.entity_type {
                    self.alias_db.add_alias(&k.name, &p.name);
                }
            }
        }
    }
    

  2. Fingerprint Weight Optimization

    impl WeightOptimizer {
        /// Optimize field weights based on historical decisions
        /// Note: f64 is appropriate here for ML model features/labels
        /// (not financial values - those use rust_decimal::Decimal)
        pub fn optimize(&self, decisions: &[MatchDecision]) -> FieldWeights {
            // Use logistic regression to find optimal weights
            let features: Vec<Vec<f64>> = decisions.iter()
                .map(|d| vec![
                    d.entity_score,
                    d.date_score,
                    d.threshold_score,
                    d.outcome_score,
                    d.source_score,
                ])
                .collect();
    
            let labels: Vec<f64> = decisions.iter()
                .map(|d| if d.decision == "approved" { 1.0 } else { 0.0 })
                .collect();
    
            let model = LogisticRegression::fit(&features, &labels);
    
            FieldWeights {
                entity: model.coefficients[0].abs(),
                date: model.coefficients[1].abs(),
                threshold: model.coefficients[2].abs(),
                outcome: model.coefficients[3].abs(),
                source: model.coefficients[4].abs(),
            }.normalized()
        }
    }
    

  3. Semantic Warning Pattern Learning

  4. Analyze rejection reasons to identify new warning patterns
  5. Add learned patterns to semantic warning detection
  6. Reduce false negatives from undetected issues

Phase 5c: Continuous Evaluation & Retraining

Requirements: FR-MD-032

  1. Golden Set Maintenance
  2. Add all approved/rejected pairs to golden test set
  3. Stratify by market category (politics, crypto, sports, etc.)
  4. Target: 100+ pairs per category

  5. Automated Regression Testing

    # Run weekly evaluation against golden set
    cargo run --features discovery -- --evaluate-matching \
        --golden-set data/golden_pairs.json \
        --output reports/weekly_eval.json
    
    # Alert if metrics degrade
    if [ $(jq '.f1_score' reports/weekly_eval.json) < 0.85 ]; then
        notify "Matching quality degraded: F1 < 0.85"
    fi
    

  6. Retraining Pipeline

    Weekly Cycle:
    1. Export new training data from decisions (Mon)
    2. Fine-tune embedding model (Tue)
    3. Optimize fingerprint weights (Wed)
    4. A/B test new models (Thu-Sat)
    5. Promote if metrics improve (Sun)
    

  7. Model Versioning

  8. Track model versions with performance metrics
  9. Enable rollback if new model underperforms
  10. Maintain audit trail of model changes

  11. Weekly Improvement Cycle

┌─────────────────────────────────────────────────────────────┐
│                  Weekly Improvement Cycle                    │
├─────────────────────────────────────────────────────────────┤
│  Monday:    Export new decisions, update golden set         │
│  Tuesday:   Retrain embedding model, optimize weights       │
│  Wednesday: Validate new models on golden set               │
│  Thursday-Saturday: A/B test (10% traffic)                  │
│  Sunday:    Promote if improved, rollback if degraded       │
└─────────────────────────────────────────────────────────────┘

Extended Requirements (Phase 3-5)

ID Requirement Phase Priority
FR-MD-018 Embedding model evaluation and selection 3a Should
FR-MD-019 Vector storage integration (sqlite-vec or pgvector) 3a Should
FR-MD-020 Batch embedding generation pipeline 3a Should
FR-MD-021 Hybrid scoring (fingerprint + embedding + text) 3b Should
FR-MD-022 Confidence calibration for hybrid scores 3b Should
FR-MD-023 Contrastive fine-tuning on approved pairs 3c Could
FR-MD-024 LLM verification prompt engineering 4a Should
FR-MD-025 Cost-optimized LLM invocation strategy 4a Should
FR-MD-026 Automated escalation to LLM for uncertain matches 4b Should
FR-MD-027 Resolution criteria deep analysis via LLM 4c Could
FR-MD-028 Decision logging with feedback data 5a Should
FR-MD-029 Training data export pipeline 5a Should
FR-MD-030 Automatic entity alias learning 5b Should
FR-MD-031 Fingerprint weight optimization from feedback 5b Could
FR-MD-032 Continuous evaluation and retraining pipeline 5c Could

Operations & Deployment

CI/CD Pipeline

The discovery feature uses Cargo feature flags for conditional compilation. CI/CD pipelines must explicitly enable the feature for testing.

GitHub Actions Configuration (.github/workflows/ci.yml):

discovery-tests:
  name: Discovery Feature Tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Install Rust
      uses: dtolnay/rust-action@stable
    - name: Run discovery unit tests
      run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery
    - name: Run discovery integration tests
      run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
      env:
        KALSHI_DEMO_KEY_ID: ${{ secrets.KALSHI_DEMO_KEY_ID }}
        KALSHI_DEMO_PRIVATE_KEY: ${{ secrets.KALSHI_DEMO_PRIVATE_KEY }}

Feature Flag Management: - Development: cargo run --features discovery -- --discover-markets - Production: Enable via ECS task definition environment variable CARGO_FEATURES=discovery - Gradual rollout: Percentage-based feature flag (future via LaunchDarkly or similar)

Deployment Pipeline: 1. PR → CI tests (unit + integration with mocked APIs) 2. Merge to main → Build Docker image with discovery feature 3. Deploy to staging → E2E validation against Kalshi demo API 4. Deploy to production → Gradual rollout with monitoring

Container Image Versioning: - Tag convention: v{version}-{git-sha} (e.g., v1.2.0-abc1234) - Latest stable: latest tag points to production-ready image - Image promotion: devstagingproduction via re-tagging - Rollback: Deploy previous image by digest (immutable reference)

Health Check Endpoint:

GET /health/discovery
Response: 200 OK
{
  "status": "healthy|degraded|unhealthy",
  "last_scan_at": "2026-01-23T10:30:00Z",
  "last_scan_duration_ms": 45000,
  "pending_candidates": 12,
  "api_status": {
    "polymarket": "healthy",
    "kalshi": "healthy"
  }
}
  • healthy: Scanner running, APIs accessible, no recent errors
  • degraded: Scanner running but one API unavailable or rate-limited
  • unhealthy: Scanner not running or database unavailable

Environment Configuration

Environment API Endpoints Credentials Purpose
Local Production APIs .env file Development
Demo Kalshi Demo API (--kalshi-demo) Demo credentials Integration testing
Staging Production APIs AWS Secrets Manager Pre-production validation
Production Production APIs AWS Secrets Manager Live operation

Environment-Specific Settings:

# Local development
DISCOVERY_SCAN_INTERVAL_SECS=3600
DISCOVERY_BATCH_SIZE=100
DISCOVERY_DB_PATH=./discovery.db

# Staging
DISCOVERY_SCAN_INTERVAL_SECS=1800
DISCOVERY_BATCH_SIZE=500
DISCOVERY_DB_PATH=/data/discovery.db

# Production
DISCOVERY_SCAN_INTERVAL_SECS=900
DISCOVERY_BATCH_SIZE=1000
DISCOVERY_DB_PATH=/data/discovery.db

Security Considerations

API Credential Management: - Polymarket Gamma API: Public API, no authentication required - Kalshi /v2/markets: Uses existing RSA-PSS authentication (ADR-009) - Key Rotation: Leverage existing key rotation infrastructure via AWS Secrets Manager - Secret Storage: All credentials stored in AWS Secrets Manager, never in code or config files

Rate Limiting Compliance:

Platform Rate Limit Implementation Monitoring
Polymarket Gamma 60 req/min Token bucket limiter CloudWatch discovery/api/rate_limit_errors
Kalshi 100 req/min Existing client limiter CloudWatch discovery/api/rate_limit_errors

Audit Trail Requirements (FR-MD-009): - All candidate approvals/rejections logged with timestamp, reviewer ID, reason - Decision context preserved (scores, features, warnings acknowledged) - JSONL format for compliance export: discovery_audit.jsonl - Retention: 7 years per financial services compliance

Data Privacy: - Market data cached locally in SQLite (no PII) - Embeddings computed locally (Phase 3a uses local models) - No market data exported to external services without explicit configuration

Testing Strategy

Test Pyramid:

         /\
        /  \  E2E (Demo environment, Kalshi demo API)
       /----\
      /      \  Integration (wiremock for HTTP mocking)
     /--------\
    /          \  Unit tests (48 existing, inline in modules)
   /--------------\

Unit Tests (48 existing): - candidate.rs: 5 tests (CandidateMatch types, status transitions) - storage.rs: 7 tests (SQLite CRUD, audit logging) - normalizer.rs: 3 tests (text normalization) - matcher.rs: 7 tests (similarity scoring, pre-filtering) - polymarket_gamma.rs: 4 tests (API client, pagination) - kalshi_markets.rs: 4 tests (API client, mve_filter) - scanner.rs: 5 tests (deduplication, batch processing) - approval.rs: 5 tests (warning acknowledgment, safety gates) - cli.rs: 8 tests (CLI argument parsing, integration)

Run with: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery

Integration Tests (wiremock): - Mock Gamma API responses with realistic market data - Mock Kalshi API responses including pagination cursors - Test rate limiting behavior under load - Test error handling and retry logic

Run with: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored

E2E Tests (Demo environment): - Full scan against Kalshi demo API (--kalshi-demo flag) - Validate candidate generation produces realistic results - Manual approval workflow verification - Audit log generation verification

Run with: cargo run --manifest-path arbiter-engine/Cargo.toml --features discovery -- --kalshi-demo --discover-markets

Golden Set Validation: - Known market pairs maintained in tests/golden_pairs.json - Automated weekly evaluation via CI job - Alert on F1 score degradation below 0.85 threshold

Production Readiness

Monitoring Metrics (CloudWatch):

Metric Type Description Alert Threshold
discovery/scan/duration_ms Timer Scan cycle duration > 5 min (P2)
discovery/scan/errors Counter Scan failures 3 consecutive (P1)
discovery/candidates/generated Gauge Candidates per scan -
discovery/candidates/approved Counter Approval count -
discovery/candidates/approval_rate Gauge Approval rate < 10% (P2)
discovery/api/rate_limit_errors Counter Rate limit violations > 10/hour (P2)
discovery/api/latency_ms Timer API response times p99 > 1s (P3)

Alerting Rules:

Priority Condition Action
P1 Scan failure (3 consecutive) Page on-call, investigate immediately
P2 Approval rate < 10% Slack alert, review matching thresholds
P2 Rate limit errors > 10/hour Slack alert, increase scan interval
P3 Scan duration > 5 minutes Slack alert, review batch size

Logging (tracing): - Structured JSON output via tracing-subscriber - Log levels: ERROR (failures), WARN (rate limits), INFO (scan results), DEBUG (matching details) - Correlation IDs for request tracing

Scaling Considerations:

Phase Configuration Capacity
MVP Single scanner, SQLite, hourly scan ~2,000 markets/platform
Scale Leader election, PostgreSQL, 15-min scan ~10,000 markets/platform
Embedding Dedicated embedding service, pgvector ~50,000 markets

Disaster Recovery:

Scenario RTO RPO Procedure
Scanner crash 5 min 0 ECS auto-restart
Database corruption 30 min 24 hr Restore from S3 backup
API outage N/A N/A Graceful degradation, retry with backoff
Region failure 4 hr 1 hr Cross-region restore from S3

Backup Strategy: - SQLite: Daily S3 backup via ECS scheduled task - PostgreSQL (future): Aurora automated backups (7-day retention) - Audit logs: S3 archival (90-day hot, 7-year cold storage)

Graceful Degradation: - API failure: Retry with exponential backoff, continue with available platform - Database failure: Read-only mode, serve cached candidates - Embedding service down (Phase 3): Fallback to fingerprint-only matching

Runbook

1. Discovery scan not running:

# Check feature flag is enabled
kubectl logs -l app=trading-core | grep "discovery"

# Verify environment variable
kubectl exec -it trading-core-xxx -- env | grep DISCOVERY

# Check scanner actor health
kubectl exec -it trading-core-xxx -- curl localhost:8080/health/discovery

2. High rate limit errors:

# Check current rate limit metrics
aws cloudwatch get-metric-statistics --namespace arbiter --metric-name "discovery/api/rate_limit_errors"

# Increase scan interval temporarily
kubectl set env deployment/trading-core DISCOVERY_SCAN_INTERVAL_SECS=7200

# Review API quotas with platform

3. Low approval rate (<10%):

# Review recent candidates
cargo run --features discovery -- --list-candidates --status pending --limit 20

# Check matching threshold configuration
grep "threshold" config/discovery.toml

# Review fingerprint weight configuration
grep "weight" config/discovery.toml

4. Database corruption:

# Stop scanner
kubectl scale deployment/trading-core --replicas=0

# Restore from S3 backup
aws s3 cp s3://arbiter-backups/discovery/latest.db /data/discovery.db

# Verify integrity
sqlite3 /data/discovery.db "PRAGMA integrity_check"

# Restart scanner
kubectl scale deployment/trading-core --replicas=1

Incident Response Summary: - P1 (Scanner failure): PagerDuty alert → On-call acknowledges → Investigate logs → Escalate if not resolved in 30 min - P2 (Quality degradation): Slack alert → Next business day review → Adjust thresholds - P3 (Performance warning): Ticket created → Sprint backlog


Consequences

Positive

  • Higher accuracy: Fingerprint matching catches semantic equivalence
  • Explainable: Field-by-field comparison is auditable
  • Extensible: Easy to add new entity types, fields, or ML models
  • Industry-aligned: Matches approach used by successful tools

Negative

  • Increased complexity: More code to maintain
  • Entity extraction errors: NER may miss or misclassify entities
  • Requires tuning: Field weights need empirical optimization

Neutral

  • Existing code preserved: Text similarity remains as fallback
  • Human review unchanged: FR-MD-003 safety gate preserved

Safety Guarantees (Unchanged)

  1. Human-in-the-loop preserved: Candidates require explicit approval
  2. FR-MD-003 maintained: Uses existing MappingManager.verify_mapping()
  3. Semantic warnings block quick approval: Must acknowledge settlement differences
  4. Audit trail: All approvals/rejections logged with timestamp and reviewer
  5. No automated trading on unverified pairs: Matches existing safety architecture

References

Industry Tools

  • pmxt - Unified API for prediction markets
  • Dome API - Developer infrastructure for prediction markets
  • Matchr - Cross-platform market aggregator (1,500+ curated matches)
  • EventArb - Cross-platform arbitrage calculator
  • Polymarket-Kalshi-Arbitrage-Bot - Open-source bot with entity extraction

Research

API Documentation