Skip to content

Cross-Platform Prediction Market Matching: A Technical Analysis

Version: 1.0 Date: 2026-01-23 Authors: Arbiter-Bot Engineering Team Status: Draft for Council Review


Executive Summary

This white paper analyzes approaches for automatically matching equivalent prediction markets across Polymarket and Kalshi. We evaluate five solution approaches based on accuracy, cost, complexity, and production readiness. Our analysis, informed by industry research and empirical testing, recommends a fingerprint-based matching pipeline as the optimal approach for production deployment.

Key findings: - Pure text similarity (Jaccard/Levenshtein) achieves only 8-9% similarity on semantically equivalent markets - Industry tools universally use entity extraction, manual curation, or hybrid approaches - A three-stage pipeline (candidate generation → fingerprint matching → human verification) balances accuracy, explainability, and safety


Table of Contents

  1. Problem Statement
  2. Industry Landscape
  3. Solution Options Analysis
  4. Recommended Architecture
  5. Implementation Considerations
  6. Evaluation Framework
  7. Phase 3: Embedding-Based Semantic Matching
  8. Phase 4: LLM-Based Verification
  9. Phase 5: Reinforcement Learning from Human Feedback
  10. Operational Excellence
  11. Risk Analysis
  12. Conclusion
  13. References

1. Problem Statement

1.1 The Matching Challenge

Prediction markets on Polymarket and Kalshi frequently cover the same real-world events but use different: - Market titles: "Will Trump buy Greenland?" vs "Will the US acquire part of Greenland in 2026?" - Resolution criteria: "OPM shutdown announcement" vs "actual shutdown exceeding 24 hours" - Outcome structures: Binary (Yes/No) vs multi-outcome ranges - Identifiers: No shared ID system across platforms

1.2 Why Matching Matters

Use Case Impact
Arbitrage detection Price discrepancies between equivalent markets create profit opportunities
Portfolio hedging Cross-platform positions require matched market identification
Market analysis Aggregated data across platforms improves price discovery research
Liquidity routing Smart order routing requires knowing equivalent markets

1.3 Empirical Evidence: Text Similarity Fails

We tested text similarity (Jaccard + Levenshtein) on known market pairs:

Kalshi Title Polymarket Title Jaccard Combined Score
"Will Trump buy Greenland?" "Will the US acquire part of Greenland in 2026?" 8.3% 22.1%
"Will Washington win the 2026 Pro Football Championship?" "Super Bowl Champion 2026" 9.1% 18.5%
"Fed rate cut before June 2026?" "FOMC to lower rates in Q2 2026?" 12.4% 24.8%

Conclusion: Text similarity algorithms cannot reliably identify semantically equivalent markets. A 60% threshold would miss all valid matches; a 10% threshold would generate thousands of false positives.


2. Industry Landscape

2.1 Commercial Solutions

Tool Approach Matching Method Limitations
Matchr Curated aggregator Human-curated database of 1,500+ matched markets Not programmable, no API
Dome API Unified API Manual market mapping by Dome team Subscription cost, external dependency
EventArb Arbitrage calculator Manual market selection by user No automated discovery
Verso Terminal UI Internal normalization layer Closed source, no matching API

2.2 Open Source Solutions

Tool Approach Matching Method Limitations
pmxt Unified library Slug-based configuration Manual matching required
Polymarket-Kalshi-Arbitrage-Bot Arbitrage bot Entity extraction + text similarity Limited documentation
Various GitHub bots Custom implementations Heuristic string/date matching Brittle, unmaintained

2.3 Key Industry Insight

No production tool relies solely on text similarity. All successful implementations use one or more of:

  1. Manual curation: Human-verified match databases (Matchr, Dome)
  2. Slug/ID configuration: User specifies which markets to compare (pmxt)
  3. Entity extraction: Extract structured fields and match on semantics
  4. Hybrid approaches: Combine multiple signals with human verification

3. Solution Options Analysis

3.1 Option A: Pure Text Similarity

Approach: Compute string similarity (Jaccard, Levenshtein, cosine) between market titles.

score = 0.6 × Jaccard(tokens) + 0.4 × Levenshtein_normalized
Criterion Assessment
Accuracy Low (8-9% on real pairs)
Cost No per-match API costs
Latency < 1ms
Complexity Low
Explainability High
Production Ready No

Why it fails: - Synonym blindness: "Super Bowl" ≠ "Pro Football Championship" - Paraphrase blindness: "buy" ≠ "acquire" - Stop word dilution: Signal words overwhelmed by common words

Verdict: Insufficient for production use.


3.2 Option B: Fingerprint-Based Matching

Approach: Extract structured "fingerprints" from markets and match on canonical fields.

struct MarketFingerprint {
    entity: String,              // "Trump", "Bitcoin", "Fed"
    secondary_entities: Vec<String>,
    event_type: EventType,       // Election, PriceTarget, Economic
    metric: Option<MetricSpec>,  // "price >= $100,000"
    resolution_date: Option<Date>,
    resolution_source: Option<String>,
    outcome_type: OutcomeType,   // Binary, Multi, Range
}
Criterion Assessment
Accuracy High (matches on semantics)
Cost No per-match API costs (local on cached data)
Latency ~10ms (entity extraction)
Complexity Medium
Explainability High (field-by-field)
Production Ready Yes

Algorithm: 1. Extract fingerprint from each market title + rules 2. Generate candidates by keyword/date overlap (fast) 3. Score fingerprint similarity with weighted fields 4. Create candidate for human review if score ≥ 0.70

Field weights (empirically tuned): - Entity match: 30% - Date match: 25% - Threshold match: 20% - Outcome structure: 15% - Resolution source: 10%

Verdict: Recommended for production.


3.3 Option C: Embedding-Based Semantic Matching

Approach: Generate dense vector embeddings of market titles and compute cosine similarity.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

emb_kalshi = model.encode("Will Trump buy Greenland?")
emb_poly = model.encode("Will the US acquire part of Greenland in 2026?")
similarity = cosine_similarity(emb_kalshi, emb_poly)
# Expected: ~0.75-0.85 (much better than Jaccard)
Criterion Assessment
Accuracy Highest (captures semantic meaning)
Cost ~$0.0001/embedding (API) or free (local)
Latency 50-200ms per embedding
Complexity High (embedding service, vector DB)
Explainability Low (black box similarity)
Production Ready Yes, but overkill for MVP

Advantages: - Captures semantic equivalence automatically - No manual entity pattern maintenance - Works on novel market types

Disadvantages: - Black box: hard to explain why two markets match - Requires embedding infrastructure - May match semantically similar but not identical markets

Verdict: Recommended for Phase 3 enhancement (after fingerprint foundation in Phase 2).


3.4 Option D: LLM-Based Verification

Approach: Use an LLM to verify whether two markets are equivalent.

Prompt: Are these two markets about the same event?
Market A: "Will Trump buy Greenland?"
Market B: "Will the US acquire part of Greenland in 2026?"

Consider:
1. Are they about the same underlying event?
2. Do they have compatible resolution criteria?
3. Would a "Yes" on one correspond to "Yes" on the other?

Output: { "match": true, "confidence": 0.92, "warnings": ["Different date scopes"] }
Criterion Assessment
Accuracy Highest (human-level reasoning)
Cost ~$0.01-0.05 per verification
Latency 200-500ms per call
Complexity Low (API call)
Explainability High (LLM provides reasoning)
Production Ready Yes, for high-value verification

Use cases: - Final verification before approving high-value matches - Edge cases where fingerprint matching is uncertain - Resolution criteria comparison

Verdict: Recommended for high-confidence final verification.


3.5 Option E: External Service Integration

Approach: Use existing matching services (Matchr, Dome) as data sources.

Service Integration Method Data Quality Dependency Risk
Matchr Scrape or unofficial API High (curated) Medium (no official API)
Dome Official SDK High High (paid, availability)
pmxt NPM library Medium Low (open source)
Criterion Assessment
Accuracy High (curated by experts)
Cost $0-$500/month depending on service
Latency 100-500ms per query
Complexity Low (API integration)
Explainability Medium (external black box)
Production Ready Yes, as validation source

Advantages: - Immediate access to curated match database - No matching logic maintenance - Validation source for our own matching

Disadvantages: - External dependency (availability, pricing changes) - May not cover all markets we care about - No customization of matching logic

Verdict: Recommended as validation/fallback source.


4.1 Three-Stage Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    Stage 1: Discovery                        │
│  ┌──────────────┐        ┌──────────────┐                   │
│  │ Polymarket   │        │   Kalshi     │                   │
│  │ Gamma API    │        │ /v2/markets  │                   │
│  └──────┬───────┘        └──────┬───────┘                   │
│         │                       │                            │
│         └───────────┬───────────┘                            │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │  Market Enumeration   │                           │
│         │  - Pagination         │                           │
│         │  - mve_filter=exclude │                           │
│         │  - Category filtering │                           │
│         └───────────┬───────────┘                           │
└─────────────────────┼───────────────────────────────────────┘
┌─────────────────────┼───────────────────────────────────────┐
│                     ▼         Stage 2: Matching             │
│         ┌───────────────────────┐                           │
│         │ Fingerprint Extractor │                           │
│         │  - Entity NER         │                           │
│         │  - Date parsing       │                           │
│         │  - Threshold parsing  │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │ Candidate Generation  │                           │
│         │  - Keyword index      │                           │
│         │  - Date proximity     │                           │
│         │  - Top N candidates   │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │ Fingerprint Matching  │                           │
│         │  - Field-by-field     │                           │
│         │  - Weighted scoring   │                           │
│         │  - Threshold ≥ 0.70   │                           │
│         └───────────┬───────────┘                           │
└─────────────────────┼───────────────────────────────────────┘
┌─────────────────────┼───────────────────────────────────────┐
│                     ▼       Stage 3: Verification           │
│         ┌───────────────────────┐                           │
│         │  Semantic Warnings    │                           │
│         │  - Resolution diff    │                           │
│         │  - Date diff          │                           │
│         │  - Source diff        │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │   Human Review CLI    │                           │
│         │  - Fingerprint diff   │                           │
│         │  - Warning ack        │                           │
│         │  - Approve/Reject     │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │  Verified Mapping     │                           │
│         │  - MappingManager     │                           │
│         │  - Audit log          │                           │
│         └───────────────────────┘                           │
└─────────────────────────────────────────────────────────────┘

4.2 Fingerprint Schema

/// Canonical market fingerprint for cross-platform matching
pub struct MarketFingerprint {
    /// Primary entity (person, asset, institution)
    pub entity: Entity,

    /// Secondary entities (locations, counterparties)
    pub secondary_entities: Vec<Entity>,

    /// Event classification
    pub event_type: EventType,

    /// Numeric metric and threshold
    pub metric: Option<MetricSpec>,

    /// Geographic or jurisdictional scope
    pub scope: Option<Scope>,

    /// Resolution timing
    pub resolution: ResolutionSpec,

    /// Outcome structure
    pub outcomes: OutcomeSpec,

    /// Original market data (for reference)
    pub source: SourceData,
}

pub struct Entity {
    pub name: String,
    pub entity_type: EntityType,
    pub aliases: Vec<String>,
}

pub enum EntityType {
    Person,      // Trump, Biden, Musk
    Asset,       // Bitcoin, ETH, Gold
    Institution, // Fed, FOMC, ECB
    Team,        // Chiefs, Eagles, Lakers
    Location,    // Greenland, Ukraine, Taiwan
    Event,       // Super Bowl, World Series
}

pub enum EventType {
    Election,
    PriceTarget,
    EconomicIndicator,
    SportOutcome,
    Acquisition,
    PolicyDecision,
    WeatherEvent,
    Other(String),
}

pub struct MetricSpec {
    pub name: String,
    pub direction: Direction,
    pub threshold: f64,
    pub unit: Option<String>,
}

pub enum Direction {
    Above,    // >= threshold
    Below,    // <= threshold
    Between,  // within range
    Exactly,  // == threshold
}

pub struct ResolutionSpec {
    pub date: Option<NaiveDate>,
    pub time: Option<NaiveTime>,
    pub timezone: Option<Tz>,
    pub source: Option<String>,
    pub criteria: Option<String>,
}

pub struct OutcomeSpec {
    pub outcome_type: OutcomeType,
    pub outcomes: Vec<String>,
}

pub enum OutcomeType {
    Binary,       // Yes/No
    MultiOutcome, // Multiple options
    Range,        // Numeric ranges
}

4.3 Matching Algorithm

impl FingerprintMatcher {
    pub fn match_score(&self, fp1: &MarketFingerprint, fp2: &MarketFingerprint) -> MatchResult {
        let mut score = 0.0;
        let mut field_scores = HashMap::new();

        // Primary entity match (30%)
        let entity_score = self.entity_similarity(&fp1.entity, &fp2.entity);
        field_scores.insert("entity", entity_score);
        score += 0.30 * entity_score;

        // Resolution date match (25%)
        let date_score = self.date_similarity(&fp1.resolution, &fp2.resolution);
        field_scores.insert("date", date_score);
        score += 0.25 * date_score;

        // Metric/threshold match (20%)
        let metric_score = self.metric_similarity(&fp1.metric, &fp2.metric);
        field_scores.insert("metric", metric_score);
        score += 0.20 * metric_score;

        // Outcome structure match (15%)
        let outcome_score = self.outcome_similarity(&fp1.outcomes, &fp2.outcomes);
        field_scores.insert("outcome", outcome_score);
        score += 0.15 * outcome_score;

        // Resolution source match (10%)
        let source_score = self.source_similarity(&fp1.resolution, &fp2.resolution);
        field_scores.insert("source", source_score);
        score += 0.10 * source_score;

        // Generate warnings
        let warnings = self.generate_warnings(fp1, fp2);

        MatchResult {
            score,
            field_scores,
            warnings,
            is_candidate: score >= self.threshold,
        }
    }

    fn entity_similarity(&self, e1: &Entity, e2: &Entity) -> f64 {
        // Exact match
        if e1.name.to_lowercase() == e2.name.to_lowercase() {
            return 1.0;
        }

        // Alias match
        for alias in &e1.aliases {
            if alias.to_lowercase() == e2.name.to_lowercase() {
                return 0.95;
            }
        }
        for alias in &e2.aliases {
            if alias.to_lowercase() == e1.name.to_lowercase() {
                return 0.95;
            }
        }

        // Same type, different entity
        if e1.entity_type == e2.entity_type {
            // Could add fuzzy string matching here
            return 0.0;
        }

        0.0
    }

    fn date_similarity(&self, r1: &ResolutionSpec, r2: &ResolutionSpec) -> f64 {
        match (&r1.date, &r2.date) {
            (Some(d1), Some(d2)) => {
                let diff = (*d1 - *d2).num_days().abs();
                match diff {
                    0 => 1.0,
                    1..=7 => 0.8,
                    8..=14 => 0.6,
                    15..=30 => 0.4,
                    _ => 0.0,
                }
            }
            (None, None) => 0.5, // Both unspecified
            _ => 0.2, // One specified, one not
        }
    }
}

5. Implementation Considerations

5.1 Entity Extraction Strategy

Phase 1: Rule-Based NER

const ENTITY_PATTERNS: &[(&str, EntityType)] = &[
    // Persons
    (r"(?i)\b(Trump|Biden|Harris|Obama|Musk|Zuckerberg)\b", EntityType::Person),

    // Assets
    (r"(?i)\b(Bitcoin|BTC|Ethereum|ETH|Gold|S&P|SPX)\b", EntityType::Asset),

    // Institutions
    (r"(?i)\b(Fed|FOMC|ECB|BoE|SEC|FTC)\b", EntityType::Institution),

    // Economic indicators
    (r"(?i)\b(CPI|GDP|NFP|unemployment|inflation)\b", EntityType::EconomicIndicator),

    // Sports
    (r"(?i)\b(Super Bowl|World Series|NBA Finals|Stanley Cup)\b", EntityType::Event),

    // Price targets
    (r"\$[\d,]+(?:\.\d+)?(?:k|K|M|B)?", EntityType::PriceTarget),

    // Dates/years
    (r"(?i)\b(20\d{2}|Q[1-4]|January|February|...)\b", EntityType::Date),
];

Phase 2: ML-Based NER (if rule-based insufficient)

# Using spaCy with custom training
import spacy
nlp = spacy.load("en_core_web_lg")

# Add custom patterns for prediction market entities
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {"label": "CRYPTO", "pattern": [{"LOWER": {"IN": ["bitcoin", "btc", "ethereum", "eth"]}}]},
    {"label": "INDICATOR", "pattern": [{"LOWER": {"IN": ["cpi", "gdp", "nfp", "fomc"]}}]},
]
ruler.add_patterns(patterns)

5.2 Synonym and Alias Handling

Build an entity alias database:

lazy_static! {
    static ref ENTITY_ALIASES: HashMap<&'static str, Vec<&'static str>> = {
        let mut m = HashMap::new();
        m.insert("Bitcoin", vec!["BTC", "bitcoin", "₿"]);
        m.insert("Ethereum", vec!["ETH", "ethereum", "Ether"]);
        m.insert("Super Bowl", vec!["Pro Football Championship", "NFL Championship"]);
        m.insert("Trump", vec!["Donald Trump", "President Trump", "DJT"]);
        m.insert("Fed", vec!["Federal Reserve", "FOMC", "Jerome Powell"]);
        m
    };
}

5.3 Date Parsing

Handle various date formats:

fn parse_resolution_date(text: &str) -> Option<NaiveDate> {
    let patterns = [
        // ISO format
        r"(\d{4}-\d{2}-\d{2})",
        // US format
        r"(January|February|...) (\d{1,2}),? (\d{4})",
        // Quarter
        r"Q([1-4]) (\d{4})",
        // End of year
        r"end of (\d{4})",
        // Before date
        r"before (January|February|...) (\d{1,2}),? (\d{4})",
    ];

    for pattern in patterns {
        if let Some(caps) = Regex::new(pattern).unwrap().captures(text) {
            return parse_captures(&caps);
        }
    }

    None
}

5.4 Performance Optimization

Candidate Generation with Inverted Index

pub struct MarketIndex {
    // Inverted index: keyword -> market IDs
    keyword_index: HashMap<String, Vec<MarketId>>,

    // Date index: date -> market IDs
    date_index: BTreeMap<NaiveDate, Vec<MarketId>>,

    // Entity index: entity -> market IDs
    entity_index: HashMap<String, Vec<MarketId>>,
}

impl MarketIndex {
    pub fn find_candidates(&self, market: &DiscoveredMarket, limit: usize) -> Vec<MarketId> {
        let mut scores: HashMap<MarketId, f32> = HashMap::new();

        // Keyword overlap
        for keyword in extract_keywords(&market.title) {
            if let Some(ids) = self.keyword_index.get(&keyword) {
                for id in ids {
                    *scores.entry(*id).or_default() += 1.0;
                }
            }
        }

        // Date proximity boost
        if let Some(date) = market.resolution_date {
            for (d, ids) in self.date_index.range(date - 14..=date + 14) {
                let proximity = 1.0 - ((*d - date).num_days().abs() as f32 / 14.0);
                for id in ids {
                    *scores.entry(*id).or_default() += proximity;
                }
            }
        }

        // Return top N by score
        let mut candidates: Vec<_> = scores.into_iter().collect();
        candidates.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        candidates.into_iter().take(limit).map(|(id, _)| id).collect()
    }
}

6. Evaluation Framework

6.1 Test Data Set

Create a "golden set" of known market pairs for validation:

ID Kalshi Market Polymarket Market Expected Match
1 KXGREENLAND-29 greenland-2026 Yes
2 KXSB-26-KC super-bowl-2026-chiefs Yes
3 KXBTC-100K btc-100k-2026 Yes
4 KXFOMC-JAN26 fed-rate-cut-jan-2026 Yes
5 KXGREENLAND-29 trump-second-term No (different event)
6 KXSB-26-KC nba-finals-2026 No (different sport)

6.2 Metrics

Metric Definition Target
Precision True matches / All proposed matches ≥ 95%
Recall True matches / All actual matches ≥ 80%
F1 Score 2 × (P × R) / (P + R) ≥ 0.87
False Positive Rate False matches / All proposed ≤ 5%
Latency Time per market pair comparison ≤ 50ms

6.3 Evaluation Protocol

# Run evaluation against golden set
cargo run --features discovery -- --evaluate-matching --golden-set data/golden_pairs.json

# Output:
# Precision: 96.2%
# Recall: 82.4%
# F1 Score: 0.888
# False Positive Rate: 3.8%
# Average Latency: 12.3ms

7. Phase 3: Embedding-Based Semantic Matching

7.1 Overview

Embedding-based matching captures semantic similarity that fingerprint matching may miss. By representing market titles as dense vectors in a semantic space, we can identify matches even when there's no lexical overlap.

Key Insight: Embeddings trained on general text understand that "Super Bowl" and "Pro Football Championship" are semantically related, even though they share no words.

7.2 Model Selection

Candidate Models

Model Dimensions Latency Domain Fit Cost
all-MiniLM-L6-v2 384 15ms Medium Free (local)
all-mpnet-base-v2 768 40ms High Free (local)
text-embedding-3-small 1536 50ms High $0.00002/1K tokens
voyage-finance-2 1024 60ms High (finance) $0.00012/1K tokens
e5-large-v2 1024 35ms High Free (local)

Selection Criteria

def evaluate_model(model_name: str, golden_pairs: list) -> ModelMetrics:
    """Evaluate embedding model on prediction market pairs."""
    model = load_model(model_name)

    # Compute embeddings
    embeddings = {}
    for pair in golden_pairs:
        embeddings[pair.kalshi_id] = model.encode(pair.kalshi_title)
        embeddings[pair.poly_id] = model.encode(pair.poly_title)

    # Calculate metrics
    true_positives = 0
    false_positives = 0

    for pair in golden_pairs:
        sim = cosine_similarity(
            embeddings[pair.kalshi_id],
            embeddings[pair.poly_id]
        )
        if pair.is_match:
            if sim >= 0.70:
                true_positives += 1
        else:
            if sim >= 0.70:
                false_positives += 1

    return ModelMetrics(
        precision=true_positives / (true_positives + false_positives),
        recall=true_positives / sum(p.is_match for p in golden_pairs),
        avg_latency=measure_latency(model)
    )

Primary: all-mpnet-base-v2 for local deployment (best accuracy/latency tradeoff) Alternative: text-embedding-3-small if API latency is acceptable

7.3 Vector Storage Architecture

Option A: SQLite with sqlite-vec (Simple)

-- Schema extension for embeddings
CREATE VIRTUAL TABLE market_embeddings USING vec0(
    market_id TEXT PRIMARY KEY,
    embedding FLOAT[768]  -- Match model dimensions
);

-- Fast ANN search
SELECT market_id, distance
FROM market_embeddings
WHERE embedding MATCH ?
  AND k = 50  -- Top 50 candidates
ORDER BY distance;

Pros: Simple, single-file database, no additional infrastructure Cons: In-memory index, limited scalability

Option B: PostgreSQL with pgvector (Production)

-- Enable extension
CREATE EXTENSION vector;

-- Add embedding column
ALTER TABLE discovered_markets
ADD COLUMN embedding vector(768);

-- Create IVFFlat index for ANN search
CREATE INDEX ON discovered_markets
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Query similar markets
SELECT platform_id, title,
       1 - (embedding <=> query_embedding) AS similarity
FROM discovered_markets
WHERE platform = 'polymarket'
ORDER BY embedding <=> query_embedding
LIMIT 50;

Pros: Scalable, mature, supports filtering during search Cons: Requires PostgreSQL infrastructure

7.4 Hybrid Scoring Algorithm

/// Combine fingerprint, embedding, and text similarity scores
pub struct HybridMatcher {
    fingerprint_matcher: FingerprintMatcher,
    embedding_matcher: EmbeddingMatcher,
    text_matcher: TextSimilarityMatcher,

    // Configurable weights (tuned via feedback)
    weights: HybridWeights,
}

#[derive(Clone)]
pub struct HybridWeights {
    pub fingerprint: f64,  // Default: 0.50
    pub embedding: f64,    // Default: 0.40
    pub text: f64,         // Default: 0.10
}

impl HybridMatcher {
    pub async fn score(&self, kalshi: &Market, poly: &Market) -> HybridScore {
        // Run all matchers in parallel
        let (fp_score, emb_score, text_score) = tokio::join!(
            self.fingerprint_matcher.score(kalshi, poly),
            self.embedding_matcher.score(kalshi, poly),
            self.text_matcher.score(kalshi, poly),
        );

        let combined = self.weights.fingerprint * fp_score.score
                     + self.weights.embedding * emb_score.similarity
                     + self.weights.text * text_score.combined;

        HybridScore {
            combined,
            fingerprint: fp_score,
            embedding: emb_score,
            text: text_score,
            is_candidate: combined >= 0.70,
        }
    }
}

7.5 Confidence Calibration

Raw similarity scores need calibration to meaningful confidence levels:

from sklearn.isotonic import IsotonicRegression

class ConfidenceCalibrator:
    def __init__(self):
        self.calibrator = IsotonicRegression(out_of_bounds='clip')

    def fit(self, scores: list[float], labels: list[bool]):
        """Fit calibrator on historical match decisions."""
        self.calibrator.fit(scores, [1.0 if l else 0.0 for l in labels])

    def calibrate(self, score: float) -> float:
        """Convert raw score to calibrated probability."""
        return self.calibrator.predict([score])[0]

Target Calibration: A score of 0.80 should mean "80% of pairs with this score are true matches"

7.6 Fine-Tuning Pipeline

When sufficient training data is available (500+ pairs), fine-tune the embedding model:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

def fine_tune_on_matches(
    base_model: str,
    positive_pairs: list[tuple[str, str]],
    negative_pairs: list[tuple[str, str]],
    output_path: str
):
    """Fine-tune embedding model on prediction market pairs."""
    model = SentenceTransformer(base_model)

    # Create training examples
    train_examples = []
    for k_title, p_title in positive_pairs:
        train_examples.append(InputExample(texts=[k_title, p_title], label=1.0))
    for k_title, p_title in negative_pairs:
        train_examples.append(InputExample(texts=[k_title, p_title], label=0.0))

    # Use contrastive loss
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.ContrastiveLoss(model)

    # Fine-tune
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=3,
        warmup_steps=100,
        output_path=output_path
    )

    return model

Expected Improvement: +5-10% F1 score on domain-specific pairs after fine-tuning


8. Phase 4: LLM-Based Verification

8.1 Overview

LLM verification provides human-level reasoning for complex cases where algorithmic matching is uncertain. It excels at: - Understanding paraphrased questions - Comparing resolution criteria semantically - Identifying subtle differences in scope or timing

8.2 Verification Prompt Engineering

Primary Verification Prompt

<system>
You are an expert analyst for prediction markets. Your task is to determine if two markets from different platforms are semantically equivalent - meaning they will resolve the same way for the same real-world outcome.
</system>

<user>
Compare these two prediction markets:

**MARKET A (Kalshi)**
- Title: {kalshi_title}
- Resolution Criteria: {kalshi_rules}
- Expiration: {kalshi_expiration}

**MARKET B (Polymarket)**
- Title: {poly_title}
- Resolution Criteria: {poly_rules}
- Expiration: {poly_expiration}

Analyze the following:
1. **Same Event?** Are both markets about the identical real-world event (not just similar topics)?
2. **Outcome Alignment?** Would "Yes" on Market A always correspond to "Yes" on Market B?
3. **Resolution Compatibility?** Are the resolution criteria functionally equivalent?
4. **Timing Differences?** Could different resolution timing cause different outcomes?
5. **Scope Differences?** Do they cover the same geographic/temporal/jurisdictional scope?

Respond in JSON format:
{
  "equivalent": true|false,
  "confidence": 0.0-1.0,
  "same_event": true|false,
  "outcome_aligned": true|false,
  "resolution_compatible": true|false,
  "reasoning": "Brief explanation",
  "warnings": ["List of potential issues"],
  "resolution_differences": ["Specific criteria differences if any"]
}
</user>

Resolution Deep-Dive Prompt (for complex cases)

<system>
You are a legal analyst specializing in prediction market resolution criteria. Analyze resolution clauses for semantic equivalence.
</system>

<user>
Compare these resolution criteria in detail:

**Criteria A:**
{criteria_a}

**Criteria B:**
{criteria_b}

Analyze:
1. **Resolution Source**: Who/what determines the outcome? Same authority?
2. **Resolution Timing**: When is the outcome determined? Same timeframe?
3. **Threshold Definition**: What constitutes Yes vs No? Same threshold?
4. **Edge Cases**: How are ambiguous situations handled? Compatible?
5. **Invalidation Conditions**: What causes market cancellation? Same conditions?

Provide structured comparison with compatibility score (0-100).
</user>

8.3 Cost-Optimized Invocation Strategy

Tiered Model Selection

pub struct LlmVerifier {
    haiku_client: AnthropicClient,   // ~$0.001/verification
    sonnet_client: AnthropicClient,  // ~$0.01/verification
    daily_budget: AtomicU64,
    daily_spend: AtomicU64,
}

impl LlmVerifier {
    pub async fn verify(&self, pair: &CandidateMatch) -> Result<LlmResult, Error> {
        // Check budget
        if self.daily_spend.load(Ordering::SeqCst) >= self.daily_budget.load(Ordering::SeqCst) {
            return Err(Error::BudgetExceeded);
        }

        // Use Haiku for initial screening
        let haiku_result = self.verify_with_haiku(pair).await?;

        // Escalate to Sonnet if uncertain or high-value
        if haiku_result.confidence < 0.85 || pair.estimated_volume > 10_000.0 {
            let sonnet_result = self.verify_with_sonnet(pair).await?;
            return Ok(sonnet_result);
        }

        Ok(haiku_result)
    }
}

Invocation Rules

Condition Action Estimated Cost
Fingerprint score < 0.60 Skip LLM (reject) $0
Fingerprint score 0.60-0.85 Invoke Haiku $0.001
Fingerprint score > 0.85 Skip LLM (approve) $0
Haiku uncertain (<0.85) Escalate to Sonnet $0.01
High-value market (>$10k volume) Always use Sonnet $0.01
Semantic warnings present Always use Sonnet $0.01

Budget Management

pub struct BudgetManager {
    daily_limit_cents: u64,
    current_spend_cents: AtomicU64,
    last_reset: AtomicU64,
}

impl BudgetManager {
    pub fn can_spend(&self, amount_cents: u64) -> bool {
        self.maybe_reset_daily();
        let current = self.current_spend_cents.load(Ordering::SeqCst);
        current + amount_cents <= self.daily_limit_cents
    }

    pub fn record_spend(&self, amount_cents: u64) {
        self.current_spend_cents.fetch_add(amount_cents, Ordering::SeqCst);
    }
}

Default Budget: $50/day (~5,000 Haiku calls or ~500 Sonnet calls)

8.4 Response Parsing and Validation

#[derive(Deserialize, Debug)]
pub struct LlmVerificationResult {
    pub equivalent: bool,
    pub confidence: f64,
    pub same_event: bool,
    pub outcome_aligned: bool,
    pub resolution_compatible: bool,
    pub reasoning: String,
    pub warnings: Vec<String>,
    pub resolution_differences: Vec<String>,
}

impl LlmVerificationResult {
    /// Validate LLM response for consistency
    pub fn validate(&self) -> Result<(), ValidationError> {
        // Confidence must be between 0 and 1
        if self.confidence < 0.0 || self.confidence > 1.0 {
            return Err(ValidationError::InvalidConfidence);
        }

        // If equivalent is true, all sub-checks should be true
        if self.equivalent && (!self.same_event || !self.outcome_aligned) {
            return Err(ValidationError::InconsistentFlags);
        }

        // Must have reasoning
        if self.reasoning.is_empty() {
            return Err(ValidationError::MissingReasoning);
        }

        Ok(())
    }

    /// Convert to human-readable report
    pub fn to_report(&self) -> String {
        format!(
            "Equivalent: {} (confidence: {:.0}%)\n\
             Reasoning: {}\n\
             Warnings: {}\n\
             Resolution Differences: {}",
            if self.equivalent { "Yes" } else { "No" },
            self.confidence * 100.0,
            self.reasoning,
            self.warnings.join(", "),
            self.resolution_differences.join("; ")
        )
    }
}

8.5 Human Review of LLM Decisions

Initially, all LLM-verified matches require human confirmation:

┌─────────────────────────────────────────────────────────────┐
│  LLM Verification Result                                    │
├─────────────────────────────────────────────────────────────┤
│  Kalshi: "Will Trump buy Greenland?"                       │
│  Polymarket: "Will the US acquire part of Greenland?"      │
│                                                             │
│  LLM Says: EQUIVALENT (confidence: 92%)                    │
│                                                             │
│  Reasoning: Both markets resolve on US acquisition of      │
│  Greenland territory. Kalshi frames as "Trump" action,     │
│  Polymarket as "US" action, but resolution criteria        │
│  both require actual transfer of territory.                │
│                                                             │
│  Warnings:                                                 │
│  - Different expiration dates (2029 vs 2026)               │
│                                                             │
│  [✓ Approve]  [✗ Reject]  [🔍 View Details]                │
└─────────────────────────────────────────────────────────────┘

Auto-Approval Criteria (after calibration)

After 100+ LLM decisions have been human-reviewed:

impl AutoApprovalPolicy {
    pub fn can_auto_approve(&self, result: &LlmVerificationResult) -> bool {
        // High confidence
        if result.confidence < 0.95 {
            return false;
        }

        // No warnings
        if !result.warnings.is_empty() {
            return false;
        }

        // No resolution differences
        if !result.resolution_differences.is_empty() {
            return false;
        }

        // Historical accuracy check
        if self.llm_historical_accuracy() < 0.98 {
            return false;
        }

        true
    }
}

9. Phase 5: Reinforcement Learning from Human Feedback

9.1 Overview

Human approval decisions are a rich source of training data. By systematically capturing and learning from these decisions, we can continuously improve all matching components.

Key Insight: Every human approval/rejection is a labeled training example that improves future matching accuracy.

9.2 Feedback Data Schema

CREATE TABLE match_decisions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    candidate_id UUID REFERENCES candidates(id),
    created_at TIMESTAMP DEFAULT now(),

    -- Decision
    decision TEXT CHECK (decision IN ('approved', 'rejected', 'modified')),
    reviewer_id TEXT NOT NULL,

    -- Context at decision time (for ML features)
    fingerprint_score REAL,
    embedding_similarity REAL,
    text_similarity REAL,
    llm_confidence REAL,
    llm_equivalent BOOLEAN,

    -- Human feedback
    rejection_reason TEXT,
    modification_notes TEXT,

    -- Entity corrections (for alias learning)
    entity_corrections JSONB,
    -- Example: {"kalshi_entity": "BTC", "poly_entity": "Bitcoin", "canonical": "Bitcoin"}

    -- Resolution analysis
    resolution_compatible BOOLEAN,
    resolution_notes TEXT,

    -- Training flags
    include_in_training BOOLEAN DEFAULT true,
    training_weight REAL DEFAULT 1.0  -- Higher for difficult cases
);

CREATE INDEX idx_decisions_training ON match_decisions(include_in_training)
    WHERE include_in_training = true;

9.3 Feedback Collection Pipeline

pub struct FeedbackCollector {
    storage: Arc<Storage>,
}

impl FeedbackCollector {
    /// Record a human decision with all context
    pub async fn record_decision(
        &self,
        candidate: &CandidateMatch,
        decision: Decision,
        reviewer: &str,
    ) -> Result<Uuid, Error> {
        let feedback = MatchDecision {
            id: Uuid::new_v4(),
            candidate_id: candidate.id,
            decision: decision.verdict,
            reviewer_id: reviewer.to_string(),

            // Capture all scores for feature analysis
            fingerprint_score: candidate.fingerprint_score,
            embedding_similarity: candidate.embedding_similarity,
            text_similarity: candidate.text_score,
            llm_confidence: candidate.llm_result.as_ref().map(|r| r.confidence),
            llm_equivalent: candidate.llm_result.as_ref().map(|r| r.equivalent),

            // Human feedback
            rejection_reason: decision.rejection_reason,
            entity_corrections: decision.entity_corrections,
            resolution_notes: decision.resolution_notes,

            include_in_training: true,
            training_weight: self.compute_training_weight(&candidate, &decision),
        };

        self.storage.insert_decision(&feedback).await?;

        // Trigger incremental learning if enabled
        if self.incremental_learning_enabled {
            self.trigger_incremental_update(&feedback).await?;
        }

        Ok(feedback.id)
    }

    /// Compute training weight (prioritize difficult/educational cases)
    fn compute_training_weight(&self, candidate: &CandidateMatch, decision: &Decision) -> f64 {
        let mut weight = 1.0;

        // Hard negatives (high score but rejected) are valuable
        if decision.verdict == "rejected" && candidate.fingerprint_score > 0.6 {
            weight *= 2.0;
        }

        // Cases with entity corrections are valuable
        if decision.entity_corrections.is_some() {
            weight *= 1.5;
        }

        // Edge cases near threshold are valuable
        if (candidate.fingerprint_score - 0.70).abs() < 0.1 {
            weight *= 1.5;
        }

        weight.min(5.0)  // Cap at 5x
    }
}

9.4 Automatic Alias Learning

pub struct AliasLearner {
    alias_db: Arc<RwLock<AliasDatabase>>,
}

impl AliasLearner {
    /// Learn from approved matches with entity differences
    pub async fn learn_from_approval(&self, decision: &MatchDecision) {
        // Learn from explicit corrections
        if let Some(corrections) = &decision.entity_corrections {
            for correction in corrections {
                self.add_alias(
                    &correction.canonical,
                    &correction.kalshi_entity,
                ).await;
                self.add_alias(
                    &correction.canonical,
                    &correction.poly_entity,
                ).await;
            }
        }

        // Learn implicit aliases from matched entity pairs
        let kalshi_entities = self.extract_entities(&decision.kalshi_title);
        let poly_entities = self.extract_entities(&decision.poly_title);

        for (k_ent, p_ent) in self.align_entities(&kalshi_entities, &poly_entities) {
            if k_ent.name != p_ent.name
                && k_ent.entity_type == p_ent.entity_type
                && self.string_similarity(&k_ent.name, &p_ent.name) < 0.5
            {
                // Different strings, same type, low similarity = alias
                log::info!("Learned alias: {} <-> {}", k_ent.name, p_ent.name);
                self.add_bidirectional_alias(&k_ent.name, &p_ent.name).await;
            }
        }
    }

    async fn add_alias(&self, canonical: &str, alias: &str) {
        let mut db = self.alias_db.write().await;
        db.add(canonical, alias);
    }
}

9.5 Fingerprint Weight Optimization

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

def optimize_weights(decisions: list[MatchDecision]) -> dict[str, float]:
    """
    Use logistic regression to find optimal field weights.

    Each decision provides features (field scores) and label (approved/rejected).
    The learned coefficients indicate optimal weights.
    """
    # Extract features
    X = np.array([
        [d.entity_score, d.date_score, d.threshold_score,
         d.outcome_score, d.source_score]
        for d in decisions
    ])

    # Labels: 1 for approved, 0 for rejected
    y = np.array([1 if d.decision == 'approved' else 0 for d in decisions])

    # Fit logistic regression
    model = LogisticRegression(penalty='l2', C=1.0)
    model.fit(X, y)

    # Cross-validation
    scores = cross_val_score(model, X, y, cv=5)
    print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

    # Extract and normalize weights
    raw_weights = np.abs(model.coef_[0])
    normalized = raw_weights / raw_weights.sum()

    return {
        'entity': normalized[0],
        'date': normalized[1],
        'threshold': normalized[2],
        'outcome': normalized[3],
        'source': normalized[4],
    }

9.6 Embedding Model Retraining

def retrain_embedding_model(
    base_model_path: str,
    decisions: list[MatchDecision],
    output_path: str
) -> EvaluationMetrics:
    """
    Retrain embedding model on accumulated human decisions.
    """
    # Prepare training data
    positive_pairs = [
        (d.kalshi_title, d.poly_title)
        for d in decisions if d.decision == 'approved'
    ]
    negative_pairs = [
        (d.kalshi_title, d.poly_title)
        for d in decisions if d.decision == 'rejected'
    ]

    # Load base model
    model = SentenceTransformer(base_model_path)

    # Create training examples with weights
    train_examples = []
    for d in decisions:
        if d.decision == 'approved':
            train_examples.append(
                InputExample(
                    texts=[d.kalshi_title, d.poly_title],
                    label=1.0
                )
            )
        elif d.decision == 'rejected' and d.fingerprint_score > 0.4:
            # Include hard negatives only
            train_examples.append(
                InputExample(
                    texts=[d.kalshi_title, d.poly_title],
                    label=0.0
                )
            )

    # Fine-tune
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.ContrastiveLoss(model)

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=2,
        warmup_steps=50
    )

    # Save
    model.save(output_path)

    # Evaluate on held-out test set
    return evaluate_model(output_path, test_pairs)

9.7 Continuous Improvement Pipeline

┌─────────────────────────────────────────────────────────────┐
│                  Weekly Improvement Cycle                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Monday: Data Export                                        │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Export new decisions from past week                    │ │
│  │ Update golden set with new test cases                  │ │
│  │ Calculate current metrics baseline                     │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Tuesday: Model Training                                    │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Retrain embedding model on accumulated decisions       │ │
│  │ Optimize fingerprint weights via logistic regression   │ │
│  │ Update alias database with learned aliases             │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Wednesday: Validation                                      │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Evaluate new models on golden set                      │ │
│  │ Compare metrics to baseline                            │ │
│  │ Flag any regressions                                   │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Thursday-Saturday: A/B Testing                             │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Deploy new model to 10% of traffic                     │ │
│  │ Monitor precision/recall in production                 │ │
│  │ Collect additional feedback                            │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Sunday: Promotion Decision                                 │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ If A/B metrics improve: promote new model to 100%      │ │
│  │ If metrics regress: rollback to previous version       │ │
│  │ Update model registry with results                     │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
└─────────────────────────────────────────────────────────────┘

9.8 Model Versioning and Rollback

pub struct ModelRegistry {
    storage: Arc<Storage>,
    current_version: AtomicU64,
}

impl ModelRegistry {
    /// Register a new model version
    pub async fn register_version(
        &self,
        model_type: ModelType,
        artifact_path: &str,
        metrics: &EvaluationMetrics,
    ) -> Result<ModelVersion, Error> {
        let version = ModelVersion {
            id: Uuid::new_v4(),
            model_type,
            artifact_path: artifact_path.to_string(),
            precision: metrics.precision,
            recall: metrics.recall,
            f1_score: metrics.f1_score,
            created_at: Utc::now(),
            is_active: false,
            training_decisions_count: metrics.training_size,
        };

        self.storage.insert_model_version(&version).await?;
        Ok(version)
    }

    /// Promote a version to active (with automatic rollback on failure)
    pub async fn promote(&self, version_id: Uuid) -> Result<(), Error> {
        let previous = self.get_active_version().await?;

        // Activate new version
        self.storage.set_active_version(version_id).await?;

        // Monitor for 1 hour
        tokio::time::sleep(Duration::from_secs(3600)).await;

        // Check if metrics degraded
        let live_metrics = self.collect_live_metrics().await?;
        if live_metrics.f1_score < previous.f1_score - 0.02 {
            log::warn!("New model degraded metrics, rolling back");
            self.storage.set_active_version(previous.id).await?;
            return Err(Error::RollbackTriggered);
        }

        Ok(())
    }
}

10. Operational Excellence

This section covers deployment, monitoring, security, and operational considerations for production deployment of the market discovery system.

Availability Target: 99.9% uptime (43.8 minutes/month downtime allowed)

10.1 Deployment Architecture

The discovery feature deploys as part of the Trading Core ECS service with optional feature flag enablement:

┌─────────────────────────────────────────────────────────────┐
│                      AWS Region                              │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │                  ECS Cluster                         │   │
│  │  ┌───────────────────┐  ┌───────────────────┐       │   │
│  │  │   Trading Core    │  │   Trading Core    │       │   │
│  │  │  (--features      │  │  (--features      │       │   │
│  │  │   discovery)      │  │   discovery)      │       │   │
│  │  │                   │  │                   │       │   │
│  │  │  ┌─────────────┐  │  │  ┌─────────────┐  │       │   │
│  │  │  │  Scanner    │  │  │  │  Scanner    │  │       │   │
│  │  │  │  Actor      │  │  │  │  Actor      │  │       │   │
│  │  │  └─────────────┘  │  │  └─────────────┘  │       │   │
│  │  │                   │  │                   │       │   │
│  │  │  ┌─────────────┐  │  │  ┌─────────────┐  │       │   │
│  │  │  │  SQLite     │  │  │  │  SQLite     │  │       │   │
│  │  │  │  (local)    │  │  │  │  (local)    │  │       │   │
│  │  │  └─────────────┘  │  │  └─────────────┘  │       │   │
│  │  └─────────┬─────────┘  └─────────┬─────────┘       │   │
│  │            │                      │                  │   │
│  └────────────┼──────────────────────┼──────────────────┘   │
│               │                      │                       │
│               ▼                      ▼                       │
│  ┌─────────────────────────────────────────────────────┐   │
│  │           Aurora PostgreSQL (future)                 │   │
│  │     (shared state for multi-instance scaling)        │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                 AWS Secrets Manager                  │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │   │
│  │  │ Kalshi Keys │  │ Poly Keys   │  │ LLM API Keys│  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Single-Instance MVP: - One scanner active at a time (ECS desired count = 1) - SQLite local storage sufficient for ~10,000 markets - No coordination needed between instances

Multi-Instance (Future Scaling): - PostgreSQL for shared candidate storage - Distributed locking for scan coordination (Redis) - Leader election for single-scanner pattern

10.2 CI/CD Pipeline

Feature-Gated Testing:

# .github/workflows/ci.yml
name: CI

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  base-tests:
    name: Base Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-action@stable
      - run: cargo test --manifest-path arbiter-engine/Cargo.toml

  discovery-tests:
    name: Discovery Feature Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-action@stable

      - name: Run unit tests
        run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery

      - name: Run integration tests
        run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
        env:
          KALSHI_DEMO_KEY_ID: ${{ secrets.KALSHI_DEMO_KEY_ID }}
          KALSHI_DEMO_PRIVATE_KEY: ${{ secrets.KALSHI_DEMO_PRIVATE_KEY }}

  security-audit:
    name: Security Audit
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: cargo install cargo-audit
      - run: cargo audit --manifest-path arbiter-engine/Cargo.toml

Deployment Pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    Deployment Pipeline                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌────────────┐    ┌────────────┐    ┌────────────┐         │
│  │    PR      │───►│   CI/CD    │───►│   Review   │         │
│  │  Created   │    │   Tests    │    │  Required  │         │
│  └────────────┘    └────────────┘    └────────────┘         │
│                           │                  │               │
│                           ▼                  ▼               │
│                    ┌────────────┐    ┌────────────┐         │
│                    │   Merge    │◄───│  Approval  │         │
│                    │  to main   │    │            │         │
│                    └────────────┘    └────────────┘         │
│                           │                                  │
│                           ▼                                  │
│                    ┌────────────┐                            │
│                    │   Build    │                            │
│                    │   Docker   │                            │
│                    └────────────┘                            │
│                           │                                  │
│              ┌────────────┼────────────┐                    │
│              ▼            ▼            ▼                    │
│       ┌──────────┐ ┌──────────┐ ┌──────────┐               │
│       │  Staging │ │    E2E   │ │  Council │               │
│       │  Deploy  │ │  Tests   │ │  Review  │               │
│       └──────────┘ └──────────┘ └──────────┘               │
│              │            │            │                    │
│              └────────────┼────────────┘                    │
│                           ▼                                  │
│                    ┌────────────┐                            │
│                    │ Production │                            │
│                    │   Deploy   │                            │
│                    └────────────┘                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

10.3 Monitoring & Observability

CloudWatch Metrics:

Metric Type Description Dashboard
discovery/scan/duration Timer Scan cycle duration Discovery Health
discovery/scan/errors Counter Scan failures Discovery Health
discovery/candidates/count Gauge Candidates generated Candidate Funnel
discovery/candidates/pending Gauge Awaiting review Candidate Funnel
discovery/candidates/approved Counter Approved matches Candidate Funnel
discovery/candidates/rejected Counter Rejected candidates Candidate Funnel
discovery/api/rate_limits Counter Rate limit errors API Performance
discovery/api/latency Timer API response time API Performance
discovery/approvals/rate Gauge Approval percentage Quality Metrics

Structured Logging:

// Example structured log output
tracing::info!(
    scan_id = %scan_id,
    platform = "polymarket",
    markets_fetched = markets.len(),
    candidates_generated = candidates.len(),
    duration_ms = elapsed.as_millis(),
    "Discovery scan completed"
);

Log Levels: - ERROR: Scan failures, API errors, database corruption - WARN: Rate limit warnings, retry attempts, degraded mode - INFO: Scan completions, candidate counts, approval decisions - DEBUG: Individual market processing, matching scores

Dashboard Panels:

  1. Discovery Health Overview
  2. Scan success rate (24h rolling)
  3. Average scan duration
  4. Error count by type

  5. Candidate Funnel

  6. Generated → Pending → Approved/Rejected
  7. Conversion rates
  8. Time in pending state

  9. API Performance

  10. Latency p50/p95/p99 by platform
  11. Rate limit error rate
  12. Request volume

10.4 Security Hardening

API Security:

Platform Authentication Credential Storage Rotation
Polymarket Gamma None (public) N/A N/A
Kalshi RSA-PSS AWS Secrets Manager 90 days
LLM APIs (Phase 4) API Key AWS Secrets Manager 30 days

Rate Limiting: - Token bucket implementation prevents API abuse - Configurable limits per platform - Automatic backoff on 429 responses

Audit Trail (FR-MD-009):

{
  "timestamp": "2026-01-23T10:30:00Z",
  "event_type": "candidate_approved",
  "candidate_id": "uuid-here",
  "reviewer_id": "operator@example.com",
  "kalshi_market": "KXGREENLAND-29",
  "poly_market": "greenland-2026",
  "fingerprint_score": 0.82,
  "warnings_acknowledged": ["Different expiration dates"],
  "decision_notes": "Verified resolution criteria compatible"
}

Access Control (Future): - Discovery CLI requires shell access (current) - RBAC for approval workflow (Phase 2) - Audit trail for all decisions

10.5 Scaling Strategy

Phase 1 (MVP - Current):

Parameter Value
Scanner instances 1
Markets per platform ~2,000
Scan interval 1 hour
Storage SQLite (local)
Estimated cost ~$50/month (ECS)

Phase 2 (Scale):

Parameter Value
Scanner instances 2-3 (leader election)
Markets per platform ~10,000
Scan interval 15 minutes
Storage PostgreSQL (Aurora)
Estimated cost ~$200/month

Phase 3 (Embedding):

Parameter Value
Embedding service Separate container
Vector database pgvector extension
GPU acceleration Optional (batch jobs)
Estimated cost ~$50/month additional

Phase 4 (LLM):

Parameter Value
LLM service Claude API
Budget controls $50/day default
Caching Response cache (24h)
Estimated cost ~$50-150/month

10.6 Disaster Recovery

Backup Strategy:

Data Backup Frequency Retention Storage
SQLite DB Daily 30 days S3 Standard
Audit logs Hourly 90 days hot, 7 years cold S3 + Glacier
Configuration On change 1 year S3 + Git

Recovery Procedures:

Scenario RTO RPO Procedure
Scanner crash 5 min 0 ECS auto-restart, health checks
Database corruption 30 min 24 hr Restore from S3, verify integrity
API outage (external) N/A N/A Graceful degradation, alerts
Region failure 4 hr 1 hr Cross-region restore, DNS failover

Graceful Degradation Modes:

  1. Polymarket API Down:
  2. Continue scanning Kalshi only
  3. Alert on-call
  4. Retry with exponential backoff

  5. Kalshi API Down:

  6. Continue scanning Polymarket only
  7. Alert on-call
  8. Retry with exponential backoff

  9. Database Unavailable:

  10. Enter read-only mode
  11. Serve cached candidates
  12. Alert P1

  13. Embedding Service Down (Phase 3):

  14. Fallback to fingerprint-only matching
  15. Log degraded mode
  16. No data loss

11. Risk Analysis

11.1 Technical Risks

Risk Probability Impact Mitigation
Entity extraction misses novel entities Medium Medium Extensible pattern system, ML fallback
False positive matches lead to bad trades Low High Human verification required (FR-MD-003)
API rate limiting blocks discovery Medium Low Configurable backoff, caching
Resolution criteria differ subtly High High Semantic warning system

11.2 Operational Risks

Risk Probability Impact Mitigation
External service (Matchr/Dome) becomes unavailable Medium Low Local matching is primary
Market format changes break parsing Low Medium Robust error handling, alerts
High volume of candidates overwhelms reviewers Medium Medium Confidence thresholds, batching

11.3 Business Risks

Risk Probability Impact Mitigation
Competitors adopt better matching Medium Medium Modular architecture allows upgrades
Platform TOS prohibit cross-platform arbitrage Low High Legal review, compliance monitoring

12. Conclusion

Based on our analysis, we recommend a five-phase approach with progressively sophisticated matching:

  1. Phase 1: Text similarity matching ✅ (Implemented)
  2. Phase 2: Fingerprint-based matching with rule-based NER (Proposed)
  3. Phase 3: Embedding-based semantic matching (hybrid scoring) (Proposed)
  4. Phase 4: LLM verification for uncertain/high-value matches (Proposed)
  5. Phase 5: Continuous improvement via human feedback learning (Proposed)

This approach: - Addresses the fundamental failure of text similarity - Aligns with industry best practices (entity extraction) - Preserves human-in-the-loop safety requirements - Creates a virtuous cycle where human decisions improve future matching - Provides graceful degradation (each phase works independently)

12.2 Implementation Priority

Phase Scope Priority Effort
2a Fingerprint schema + rule-based NER Must Medium
2b Fingerprint matcher + weighted scoring Must Medium
2c Golden set validation + tuning Must Low
3a Embedding infrastructure + model selection Should Medium
3b Hybrid scoring integration Should Medium
3c Embedding fine-tuning pipeline Could High
4a LLM verification prompts + integration Should Medium
4b Automated escalation rules Should Low
4c Resolution deep analysis Could Medium
5a Feedback data collection Should Low
5b Automatic alias/weight learning Should Medium
5c Continuous retraining pipeline Could High

12.3 Success Criteria

Metric Phase 2 Target Phase 3+ Target
Recall ≥ 70% ≥ 85%
Precision ≥ 90% ≥ 95%
F1 Score ≥ 0.78 ≥ 0.90
Latency (p99) ≤ 50ms ≤ 200ms
Human verification 100% 100% (safety preserved)

12.4 Key Innovation: Learning from Human Decisions

The most significant architectural decision is treating human approvals/rejections as training data:

                    ┌─────────────────┐
                    │  Human Reviews  │
                    └────────┬────────┘
        ┌────────────────────┼────────────────────┐
        ▼                    ▼                    ▼
┌───────────────┐  ┌─────────────────┐  ┌───────────────┐
│ Alias Updates │  │  Weight Tuning  │  │ Model Retrain │
└───────────────┘  └─────────────────┘  └───────────────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                    ┌─────────────────┐
                    │ Improved Models │
                    └────────┬────────┘
                    ┌─────────────────┐
                    │ Better Matches  │
                    └────────┬────────┘
                    ┌─────────────────┐
                    │  Human Reviews  │ ← (cycle continues)
                    └─────────────────┘

This creates a data flywheel where each human decision makes the system smarter, reducing future human workload while maintaining safety.


13. References

Industry Tools

Research

API Documentation


Appendix A: Entity Pattern Reference

// Full entity pattern list for rule-based NER
const ENTITY_PATTERNS: &[(&str, EntityType)] = &[
    // Politicians
    (r"(?i)\bTrump\b", EntityType::Person),
    (r"(?i)\bBiden\b", EntityType::Person),
    (r"(?i)\bHarris\b", EntityType::Person),
    (r"(?i)\bObama\b", EntityType::Person),
    (r"(?i)\bDeSantis\b", EntityType::Person),
    (r"(?i)\bNewsom\b", EntityType::Person),

    // Tech figures
    (r"(?i)\bMusk\b", EntityType::Person),
    (r"(?i)\bZuckerberg\b", EntityType::Person),
    (r"(?i)\bAltman\b", EntityType::Person),

    // Cryptocurrencies
    (r"(?i)\b(Bitcoin|BTC)\b", EntityType::Asset),
    (r"(?i)\b(Ethereum|ETH)\b", EntityType::Asset),
    (r"(?i)\b(Solana|SOL)\b", EntityType::Asset),
    (r"(?i)\b(XRP|Ripple)\b", EntityType::Asset),

    // Stocks/Indices
    (r"(?i)\b(S&P|SPX|SPY)\b", EntityType::Asset),
    (r"(?i)\b(Nasdaq|QQQ)\b", EntityType::Asset),
    (r"(?i)\b(Tesla|TSLA)\b", EntityType::Asset),
    (r"(?i)\b(Nvidia|NVDA)\b", EntityType::Asset),

    // Central banks
    (r"(?i)\b(Fed|Federal Reserve|FOMC)\b", EntityType::Institution),
    (r"(?i)\b(ECB)\b", EntityType::Institution),
    (r"(?i)\b(BoE|Bank of England)\b", EntityType::Institution),
    (r"(?i)\b(BoJ|Bank of Japan)\b", EntityType::Institution),

    // Economic indicators
    (r"(?i)\bCPI\b", EntityType::EconomicIndicator),
    (r"(?i)\bGDP\b", EntityType::EconomicIndicator),
    (r"(?i)\bNFP\b", EntityType::EconomicIndicator),
    (r"(?i)\b(unemployment|jobless)\b", EntityType::EconomicIndicator),
    (r"(?i)\binflation\b", EntityType::EconomicIndicator),

    // Sports events
    (r"(?i)\bSuper Bowl\b", EntityType::Event),
    (r"(?i)\bWorld Series\b", EntityType::Event),
    (r"(?i)\bNBA Finals\b", EntityType::Event),
    (r"(?i)\bStanley Cup\b", EntityType::Event),
    (r"(?i)\bWorld Cup\b", EntityType::Event),

    // Locations
    (r"(?i)\bGreenland\b", EntityType::Location),
    (r"(?i)\bUkraine\b", EntityType::Location),
    (r"(?i)\bTaiwan\b", EntityType::Location),
    (r"(?i)\bPanama\b", EntityType::Location),

    // Price targets
    (r"\$[\d,]+(?:\.\d+)?(?:k|K|M|B)?", EntityType::PriceTarget),

    // Dates
    (r"(?i)\b(20\d{2})\b", EntityType::Year),
    (r"(?i)\bQ[1-4]\b", EntityType::Quarter),
];

Appendix B: Sample Fingerprint Extraction

Input (Kalshi):

Title: "Will Trump buy Greenland?"
Rules: "Resolves Yes if US purchases at least part of Greenland from Denmark before January 20, 2029"

Output:

{
  "entity": {
    "name": "Trump",
    "entity_type": "Person",
    "aliases": ["Donald Trump", "DJT"]
  },
  "secondary_entities": [
    { "name": "Greenland", "entity_type": "Location" },
    { "name": "Denmark", "entity_type": "Location" },
    { "name": "US", "entity_type": "Location" }
  ],
  "event_type": "Acquisition",
  "metric": null,
  "scope": { "region": "US", "jurisdiction": "Federal" },
  "resolution": {
    "date": "2029-01-20",
    "timezone": null,
    "source": null,
    "criteria": "US purchases at least part of Greenland from Denmark"
  },
  "outcomes": {
    "outcome_type": "Binary",
    "outcomes": ["Yes", "No"]
  }
}


End of White Paper