Cross-Platform Prediction Market Matching: A Technical Analysis¶

Version: 1.0 Date: 2026-01-23 Authors: Arbiter-Bot Engineering Team Status: Draft for Council Review

Executive Summary¶

This white paper analyzes approaches for automatically matching equivalent prediction markets across Polymarket and Kalshi. We evaluate five solution approaches based on accuracy, cost, complexity, and production readiness. Our analysis, informed by industry research and empirical testing, recommends a fingerprint-based matching pipeline as the optimal approach for production deployment.

Key findings: - Pure text similarity (Jaccard/Levenshtein) achieves only 8-9% similarity on semantically equivalent markets - Industry tools universally use entity extraction, manual curation, or hybrid approaches - A three-stage pipeline (candidate generation → fingerprint matching → human verification) balances accuracy, explainability, and safety

Table of Contents¶

Problem Statement
Industry Landscape
Solution Options Analysis
Recommended Architecture
Implementation Considerations
Evaluation Framework
Phase 3: Embedding-Based Semantic Matching
Phase 4: LLM-Based Verification
Phase 5: Reinforcement Learning from Human Feedback
Operational Excellence
Risk Analysis
Conclusion
References

1. Problem Statement¶

1.1 The Matching Challenge¶

Prediction markets on Polymarket and Kalshi frequently cover the same real-world events but use different: - Market titles: "Will Trump buy Greenland?" vs "Will the US acquire part of Greenland in 2026?" - Resolution criteria: "OPM shutdown announcement" vs "actual shutdown exceeding 24 hours" - Outcome structures: Binary (Yes/No) vs multi-outcome ranges - Identifiers: No shared ID system across platforms

1.2 Why Matching Matters¶

Use Case	Impact
Arbitrage detection	Price discrepancies between equivalent markets create profit opportunities
Portfolio hedging	Cross-platform positions require matched market identification
Market analysis	Aggregated data across platforms improves price discovery research
Liquidity routing	Smart order routing requires knowing equivalent markets

1.3 Empirical Evidence: Text Similarity Fails¶

We tested text similarity (Jaccard + Levenshtein) on known market pairs:

Kalshi Title	Polymarket Title	Jaccard	Combined Score
"Will Trump buy Greenland?"	"Will the US acquire part of Greenland in 2026?"	8.3%	22.1%
"Will Washington win the 2026 Pro Football Championship?"	"Super Bowl Champion 2026"	9.1%	18.5%
"Fed rate cut before June 2026?"	"FOMC to lower rates in Q2 2026?"	12.4%	24.8%

Conclusion: Text similarity algorithms cannot reliably identify semantically equivalent markets. A 60% threshold would miss all valid matches; a 10% threshold would generate thousands of false positives.

2. Industry Landscape¶

2.1 Commercial Solutions¶

Tool	Approach	Matching Method	Limitations
Matchr	Curated aggregator	Human-curated database of 1,500+ matched markets	Not programmable, no API
Dome API	Unified API	Manual market mapping by Dome team	Subscription cost, external dependency
EventArb	Arbitrage calculator	Manual market selection by user	No automated discovery
Verso	Terminal UI	Internal normalization layer	Closed source, no matching API

2.2 Open Source Solutions¶

Tool	Approach	Matching Method	Limitations
pmxt	Unified library	Slug-based configuration	Manual matching required
Polymarket-Kalshi-Arbitrage-Bot	Arbitrage bot	Entity extraction + text similarity	Limited documentation
Various GitHub bots	Custom implementations	Heuristic string/date matching	Brittle, unmaintained

2.3 Key Industry Insight¶

No production tool relies solely on text similarity. All successful implementations use one or more of:

Manual curation: Human-verified match databases (Matchr, Dome)
Slug/ID configuration: User specifies which markets to compare (pmxt)
Entity extraction: Extract structured fields and match on semantics
Hybrid approaches: Combine multiple signals with human verification

3. Solution Options Analysis¶

3.1 Option A: Pure Text Similarity¶

Approach: Compute string similarity (Jaccard, Levenshtein, cosine) between market titles.

score = 0.6 × Jaccard(tokens) + 0.4 × Levenshtein_normalized

Criterion	Assessment
Accuracy	Low (8-9% on real pairs)
Cost	No per-match API costs
Latency	< 1ms
Complexity	Low
Explainability	High
Production Ready	No

Why it fails: - Synonym blindness: "Super Bowl" ≠ "Pro Football Championship" - Paraphrase blindness: "buy" ≠ "acquire" - Stop word dilution: Signal words overwhelmed by common words

Verdict: Insufficient for production use.

3.2 Option B: Fingerprint-Based Matching¶

Approach: Extract structured "fingerprints" from markets and match on canonical fields.

struct MarketFingerprint {
    entity: String,              // "Trump", "Bitcoin", "Fed"
    secondary_entities: Vec<String>,
    event_type: EventType,       // Election, PriceTarget, Economic
    metric: Option<MetricSpec>,  // "price >= $100,000"
    resolution_date: Option<Date>,
    resolution_source: Option<String>,
    outcome_type: OutcomeType,   // Binary, Multi, Range
}

Criterion	Assessment
Accuracy	High (matches on semantics)
Cost	No per-match API costs (local on cached data)
Latency	~10ms (entity extraction)
Complexity	Medium
Explainability	High (field-by-field)
Production Ready	Yes

Algorithm: 1. Extract fingerprint from each market title + rules 2. Generate candidates by keyword/date overlap (fast) 3. Score fingerprint similarity with weighted fields 4. Create candidate for human review if score ≥ 0.70

Field weights (empirically tuned): - Entity match: 30% - Date match: 25% - Threshold match: 20% - Outcome structure: 15% - Resolution source: 10%

Verdict: Recommended for production.

3.3 Option C: Embedding-Based Semantic Matching¶

Approach: Generate dense vector embeddings of market titles and compute cosine similarity.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

emb_kalshi = model.encode("Will Trump buy Greenland?")
emb_poly = model.encode("Will the US acquire part of Greenland in 2026?")
similarity = cosine_similarity(emb_kalshi, emb_poly)
# Expected: ~0.75-0.85 (much better than Jaccard)

Criterion	Assessment
Accuracy	Highest (captures semantic meaning)
Cost	~$0.0001/embedding (API) or free (local)
Latency	50-200ms per embedding
Complexity	High (embedding service, vector DB)
Explainability	Low (black box similarity)
Production Ready	Yes, but overkill for MVP

Advantages: - Captures semantic equivalence automatically - No manual entity pattern maintenance - Works on novel market types

Disadvantages: - Black box: hard to explain why two markets match - Requires embedding infrastructure - May match semantically similar but not identical markets

Verdict: Recommended for Phase 3 enhancement (after fingerprint foundation in Phase 2).

3.4 Option D: LLM-Based Verification¶

Approach: Use an LLM to verify whether two markets are equivalent.

Prompt: Are these two markets about the same event?
Market A: "Will Trump buy Greenland?"
Market B: "Will the US acquire part of Greenland in 2026?"

Consider:
1. Are they about the same underlying event?
2. Do they have compatible resolution criteria?
3. Would a "Yes" on one correspond to "Yes" on the other?

Output: { "match": true, "confidence": 0.92, "warnings": ["Different date scopes"] }

Criterion	Assessment
Accuracy	Highest (human-level reasoning)
Cost	~$0.01-0.05 per verification
Latency	200-500ms per call
Complexity	Low (API call)
Explainability	High (LLM provides reasoning)
Production Ready	Yes, for high-value verification

Use cases: - Final verification before approving high-value matches - Edge cases where fingerprint matching is uncertain - Resolution criteria comparison

Verdict: Recommended for high-confidence final verification.

3.5 Option E: External Service Integration¶

Approach: Use existing matching services (Matchr, Dome) as data sources.

Service	Integration Method	Data Quality	Dependency Risk
Matchr	Scrape or unofficial API	High (curated)	Medium (no official API)
Dome	Official SDK	High	High (paid, availability)
pmxt	NPM library	Medium	Low (open source)

Criterion	Assessment
Accuracy	High (curated by experts)
Cost	$0-$500/month depending on service
Latency	100-500ms per query
Complexity	Low (API integration)
Explainability	Medium (external black box)
Production Ready	Yes, as validation source

Advantages: - Immediate access to curated match database - No matching logic maintenance - Validation source for our own matching

Disadvantages: - External dependency (availability, pricing changes) - May not cover all markets we care about - No customization of matching logic

Verdict: Recommended as validation/fallback source.

4. Recommended Architecture¶

4.1 Three-Stage Pipeline¶

┌─────────────────────────────────────────────────────────────┐
│                    Stage 1: Discovery                        │
│  ┌──────────────┐        ┌──────────────┐                   │
│  │ Polymarket   │        │   Kalshi     │                   │
│  │ Gamma API    │        │ /v2/markets  │                   │
│  └──────┬───────┘        └──────┬───────┘                   │
│         │                       │                            │
│         └───────────┬───────────┘                            │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │  Market Enumeration   │                           │
│         │  - Pagination         │                           │
│         │  - mve_filter=exclude │                           │
│         │  - Category filtering │                           │
│         └───────────┬───────────┘                           │
└─────────────────────┼───────────────────────────────────────┘
                      │
┌─────────────────────┼───────────────────────────────────────┐
│                     ▼         Stage 2: Matching             │
│         ┌───────────────────────┐                           │
│         │ Fingerprint Extractor │                           │
│         │  - Entity NER         │                           │
│         │  - Date parsing       │                           │
│         │  - Threshold parsing  │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │ Candidate Generation  │                           │
│         │  - Keyword index      │                           │
│         │  - Date proximity     │                           │
│         │  - Top N candidates   │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │ Fingerprint Matching  │                           │
│         │  - Field-by-field     │                           │
│         │  - Weighted scoring   │                           │
│         │  - Threshold ≥ 0.70   │                           │
│         └───────────┬───────────┘                           │
└─────────────────────┼───────────────────────────────────────┘
                      │
┌─────────────────────┼───────────────────────────────────────┐
│                     ▼       Stage 3: Verification           │
│         ┌───────────────────────┐                           │
│         │  Semantic Warnings    │                           │
│         │  - Resolution diff    │                           │
│         │  - Date diff          │                           │
│         │  - Source diff        │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │   Human Review CLI    │                           │
│         │  - Fingerprint diff   │                           │
│         │  - Warning ack        │                           │
│         │  - Approve/Reject     │                           │
│         └───────────┬───────────┘                           │
│                     │                                        │
│                     ▼                                        │
│         ┌───────────────────────┐                           │
│         │  Verified Mapping     │                           │
│         │  - MappingManager     │                           │
│         │  - Audit log          │                           │
│         └───────────────────────┘                           │
└─────────────────────────────────────────────────────────────┘

4.2 Fingerprint Schema¶

/// Canonical market fingerprint for cross-platform matching
pub struct MarketFingerprint {
    /// Primary entity (person, asset, institution)
    pub entity: Entity,

    /// Secondary entities (locations, counterparties)
    pub secondary_entities: Vec<Entity>,

    /// Event classification
    pub event_type: EventType,

    /// Numeric metric and threshold
    pub metric: Option<MetricSpec>,

    /// Geographic or jurisdictional scope
    pub scope: Option<Scope>,

    /// Resolution timing
    pub resolution: ResolutionSpec,

    /// Outcome structure
    pub outcomes: OutcomeSpec,

    /// Original market data (for reference)
    pub source: SourceData,
}

pub struct Entity {
    pub name: String,
    pub entity_type: EntityType,
    pub aliases: Vec<String>,
}

pub enum EntityType {
    Person,      // Trump, Biden, Musk
    Asset,       // Bitcoin, ETH, Gold
    Institution, // Fed, FOMC, ECB
    Team,        // Chiefs, Eagles, Lakers
    Location,    // Greenland, Ukraine, Taiwan
    Event,       // Super Bowl, World Series
}

pub enum EventType {
    Election,
    PriceTarget,
    EconomicIndicator,
    SportOutcome,
    Acquisition,
    PolicyDecision,
    WeatherEvent,
    Other(String),
}

pub struct MetricSpec {
    pub name: String,
    pub direction: Direction,
    pub threshold: f64,
    pub unit: Option<String>,
}

pub enum Direction {
    Above,    // >= threshold
    Below,    // <= threshold
    Between,  // within range
    Exactly,  // == threshold
}

pub struct ResolutionSpec {
    pub date: Option<NaiveDate>,
    pub time: Option<NaiveTime>,
    pub timezone: Option<Tz>,
    pub source: Option<String>,
    pub criteria: Option<String>,
}

pub struct OutcomeSpec {
    pub outcome_type: OutcomeType,
    pub outcomes: Vec<String>,
}

pub enum OutcomeType {
    Binary,       // Yes/No
    MultiOutcome, // Multiple options
    Range,        // Numeric ranges
}

4.3 Matching Algorithm¶

impl FingerprintMatcher {
    pub fn match_score(&self, fp1: &MarketFingerprint, fp2: &MarketFingerprint) -> MatchResult {
        let mut score = 0.0;
        let mut field_scores = HashMap::new();

        // Primary entity match (30%)
        let entity_score = self.entity_similarity(&fp1.entity, &fp2.entity);
        field_scores.insert("entity", entity_score);
        score += 0.30 * entity_score;

        // Resolution date match (25%)
        let date_score = self.date_similarity(&fp1.resolution, &fp2.resolution);
        field_scores.insert("date", date_score);
        score += 0.25 * date_score;

        // Metric/threshold match (20%)
        let metric_score = self.metric_similarity(&fp1.metric, &fp2.metric);
        field_scores.insert("metric", metric_score);
        score += 0.20 * metric_score;

        // Outcome structure match (15%)
        let outcome_score = self.outcome_similarity(&fp1.outcomes, &fp2.outcomes);
        field_scores.insert("outcome", outcome_score);
        score += 0.15 * outcome_score;

        // Resolution source match (10%)
        let source_score = self.source_similarity(&fp1.resolution, &fp2.resolution);
        field_scores.insert("source", source_score);
        score += 0.10 * source_score;

        // Generate warnings
        let warnings = self.generate_warnings(fp1, fp2);

        MatchResult {
            score,
            field_scores,
            warnings,
            is_candidate: score >= self.threshold,
        }
    }

    fn entity_similarity(&self, e1: &Entity, e2: &Entity) -> f64 {
        // Exact match
        if e1.name.to_lowercase() == e2.name.to_lowercase() {
            return 1.0;
        }

        // Alias match
        for alias in &e1.aliases {
            if alias.to_lowercase() == e2.name.to_lowercase() {
                return 0.95;
            }
        }
        for alias in &e2.aliases {
            if alias.to_lowercase() == e1.name.to_lowercase() {
                return 0.95;
            }
        }

        // Same type, different entity
        if e1.entity_type == e2.entity_type {
            // Could add fuzzy string matching here
            return 0.0;
        }

        0.0
    }

    fn date_similarity(&self, r1: &ResolutionSpec, r2: &ResolutionSpec) -> f64 {
        match (&r1.date, &r2.date) {
            (Some(d1), Some(d2)) => {
                let diff = (*d1 - *d2).num_days().abs();
                match diff {
                    0 => 1.0,
                    1..=7 => 0.8,
                    8..=14 => 0.6,
                    15..=30 => 0.4,
                    _ => 0.0,
                }
            }
            (None, None) => 0.5, // Both unspecified
            _ => 0.2, // One specified, one not
        }
    }
}

5. Implementation Considerations¶

5.1 Entity Extraction Strategy¶

Phase 1: Rule-Based NER

const ENTITY_PATTERNS: &[(&str, EntityType)] = &[
    // Persons
    (r"(?i)\b(Trump|Biden|Harris|Obama|Musk|Zuckerberg)\b", EntityType::Person),

    // Assets
    (r"(?i)\b(Bitcoin|BTC|Ethereum|ETH|Gold|S&P|SPX)\b", EntityType::Asset),

    // Institutions
    (r"(?i)\b(Fed|FOMC|ECB|BoE|SEC|FTC)\b", EntityType::Institution),

    // Economic indicators
    (r"(?i)\b(CPI|GDP|NFP|unemployment|inflation)\b", EntityType::EconomicIndicator),

    // Sports
    (r"(?i)\b(Super Bowl|World Series|NBA Finals|Stanley Cup)\b", EntityType::Event),

    // Price targets
    (r"\$[\d,]+(?:\.\d+)?(?:k|K|M|B)?", EntityType::PriceTarget),

    // Dates/years
    (r"(?i)\b(20\d{2}|Q[1-4]|January|February|...)\b", EntityType::Date),
];

Phase 2: ML-Based NER (if rule-based insufficient)

# Using spaCy with custom training
import spacy
nlp = spacy.load("en_core_web_lg")

# Add custom patterns for prediction market entities
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {"label": "CRYPTO", "pattern": [{"LOWER": {"IN": ["bitcoin", "btc", "ethereum", "eth"]}}]},
    {"label": "INDICATOR", "pattern": [{"LOWER": {"IN": ["cpi", "gdp", "nfp", "fomc"]}}]},
]
ruler.add_patterns(patterns)

5.2 Synonym and Alias Handling¶

Build an entity alias database:

lazy_static! {
    static ref ENTITY_ALIASES: HashMap<&'static str, Vec<&'static str>> = {
        let mut m = HashMap::new();
        m.insert("Bitcoin", vec!["BTC", "bitcoin", "₿"]);
        m.insert("Ethereum", vec!["ETH", "ethereum", "Ether"]);
        m.insert("Super Bowl", vec!["Pro Football Championship", "NFL Championship"]);
        m.insert("Trump", vec!["Donald Trump", "President Trump", "DJT"]);
        m.insert("Fed", vec!["Federal Reserve", "FOMC", "Jerome Powell"]);
        m
    };
}

5.3 Date Parsing¶

Handle various date formats:

fn parse_resolution_date(text: &str) -> Option<NaiveDate> {
    let patterns = [
        // ISO format
        r"(\d{4}-\d{2}-\d{2})",
        // US format
        r"(January|February|...) (\d{1,2}),? (\d{4})",
        // Quarter
        r"Q([1-4]) (\d{4})",
        // End of year
        r"end of (\d{4})",
        // Before date
        r"before (January|February|...) (\d{1,2}),? (\d{4})",
    ];

    for pattern in patterns {
        if let Some(caps) = Regex::new(pattern).unwrap().captures(text) {
            return parse_captures(&caps);
        }
    }

    None
}

5.4 Performance Optimization¶

Candidate Generation with Inverted Index

pub struct MarketIndex {
    // Inverted index: keyword -> market IDs
    keyword_index: HashMap<String, Vec<MarketId>>,

    // Date index: date -> market IDs
    date_index: BTreeMap<NaiveDate, Vec<MarketId>>,

    // Entity index: entity -> market IDs
    entity_index: HashMap<String, Vec<MarketId>>,
}

impl MarketIndex {
    pub fn find_candidates(&self, market: &DiscoveredMarket, limit: usize) -> Vec<MarketId> {
        let mut scores: HashMap<MarketId, f32> = HashMap::new();

        // Keyword overlap
        for keyword in extract_keywords(&market.title) {
            if let Some(ids) = self.keyword_index.get(&keyword) {
                for id in ids {
                    *scores.entry(*id).or_default() += 1.0;
                }
            }
        }

        // Date proximity boost
        if let Some(date) = market.resolution_date {
            for (d, ids) in self.date_index.range(date - 14..=date + 14) {
                let proximity = 1.0 - ((*d - date).num_days().abs() as f32 / 14.0);
                for id in ids {
                    *scores.entry(*id).or_default() += proximity;
                }
            }
        }

        // Return top N by score
        let mut candidates: Vec<_> = scores.into_iter().collect();
        candidates.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        candidates.into_iter().take(limit).map(|(id, _)| id).collect()
    }
}

6. Evaluation Framework¶

6.1 Test Data Set¶

Create a "golden set" of known market pairs for validation:

ID	Kalshi Market	Polymarket Market	Expected Match
1	KXGREENLAND-29	greenland-2026	Yes
2	KXSB-26-KC	super-bowl-2026-chiefs	Yes
3	KXBTC-100K	btc-100k-2026	Yes
4	KXFOMC-JAN26	fed-rate-cut-jan-2026	Yes
5	KXGREENLAND-29	trump-second-term	No (different event)
6	KXSB-26-KC	nba-finals-2026	No (different sport)

6.2 Metrics¶

Metric	Definition	Target
Precision	True matches / All proposed matches	≥ 95%
Recall	True matches / All actual matches	≥ 80%
F1 Score	2 × (P × R) / (P + R)	≥ 0.87
False Positive Rate	False matches / All proposed	≤ 5%
Latency	Time per market pair comparison	≤ 50ms

6.3 Evaluation Protocol¶

# Run evaluation against golden set
cargo run --features discovery -- --evaluate-matching --golden-set data/golden_pairs.json

# Output:
# Precision: 96.2%
# Recall: 82.4%
# F1 Score: 0.888
# False Positive Rate: 3.8%
# Average Latency: 12.3ms

7. Phase 3: Embedding-Based Semantic Matching¶

7.1 Overview¶

Embedding-based matching captures semantic similarity that fingerprint matching may miss. By representing market titles as dense vectors in a semantic space, we can identify matches even when there's no lexical overlap.

Key Insight: Embeddings trained on general text understand that "Super Bowl" and "Pro Football Championship" are semantically related, even though they share no words.

7.2 Model Selection¶

Candidate Models¶

Model	Dimensions	Latency	Domain Fit	Cost
`all-MiniLM-L6-v2`	384	15ms	Medium	Free (local)
`all-mpnet-base-v2`	768	40ms	High	Free (local)
`text-embedding-3-small`	1536	50ms	High	$0.00002/1K tokens
`voyage-finance-2`	1024	60ms	High (finance)	$0.00012/1K tokens
`e5-large-v2`	1024	35ms	High	Free (local)

Selection Criteria¶

def evaluate_model(model_name: str, golden_pairs: list) -> ModelMetrics:
    """Evaluate embedding model on prediction market pairs."""
    model = load_model(model_name)

    # Compute embeddings
    embeddings = {}
    for pair in golden_pairs:
        embeddings[pair.kalshi_id] = model.encode(pair.kalshi_title)
        embeddings[pair.poly_id] = model.encode(pair.poly_title)

    # Calculate metrics
    true_positives = 0
    false_positives = 0

    for pair in golden_pairs:
        sim = cosine_similarity(
            embeddings[pair.kalshi_id],
            embeddings[pair.poly_id]
        )
        if pair.is_match:
            if sim >= 0.70:
                true_positives += 1
        else:
            if sim >= 0.70:
                false_positives += 1

    return ModelMetrics(
        precision=true_positives / (true_positives + false_positives),
        recall=true_positives / sum(p.is_match for p in golden_pairs),
        avg_latency=measure_latency(model)
    )

Recommended Model¶

Primary: all-mpnet-base-v2 for local deployment (best accuracy/latency tradeoff) Alternative: text-embedding-3-small if API latency is acceptable

7.3 Vector Storage Architecture¶

Option A: SQLite with sqlite-vec (Simple)¶

-- Schema extension for embeddings
CREATE VIRTUAL TABLE market_embeddings USING vec0(
    market_id TEXT PRIMARY KEY,
    embedding FLOAT[768]  -- Match model dimensions
);

-- Fast ANN search
SELECT market_id, distance
FROM market_embeddings
WHERE embedding MATCH ?
  AND k = 50  -- Top 50 candidates
ORDER BY distance;

Pros: Simple, single-file database, no additional infrastructure Cons: In-memory index, limited scalability

Option B: PostgreSQL with pgvector (Production)¶

-- Enable extension
CREATE EXTENSION vector;

-- Add embedding column
ALTER TABLE discovered_markets
ADD COLUMN embedding vector(768);

-- Create IVFFlat index for ANN search
CREATE INDEX ON discovered_markets
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Query similar markets
SELECT platform_id, title,
       1 - (embedding <=> query_embedding) AS similarity
FROM discovered_markets
WHERE platform = 'polymarket'
ORDER BY embedding <=> query_embedding
LIMIT 50;

Pros: Scalable, mature, supports filtering during search Cons: Requires PostgreSQL infrastructure

7.4 Hybrid Scoring Algorithm¶

/// Combine fingerprint, embedding, and text similarity scores
pub struct HybridMatcher {
    fingerprint_matcher: FingerprintMatcher,
    embedding_matcher: EmbeddingMatcher,
    text_matcher: TextSimilarityMatcher,

    // Configurable weights (tuned via feedback)
    weights: HybridWeights,
}

#[derive(Clone)]
pub struct HybridWeights {
    pub fingerprint: f64,  // Default: 0.50
    pub embedding: f64,    // Default: 0.40
    pub text: f64,         // Default: 0.10
}

impl HybridMatcher {
    pub async fn score(&self, kalshi: &Market, poly: &Market) -> HybridScore {
        // Run all matchers in parallel
        let (fp_score, emb_score, text_score) = tokio::join!(
            self.fingerprint_matcher.score(kalshi, poly),
            self.embedding_matcher.score(kalshi, poly),
            self.text_matcher.score(kalshi, poly),
        );

        let combined = self.weights.fingerprint * fp_score.score
                     + self.weights.embedding * emb_score.similarity
                     + self.weights.text * text_score.combined;

        HybridScore {
            combined,
            fingerprint: fp_score,
            embedding: emb_score,
            text: text_score,
            is_candidate: combined >= 0.70,
        }
    }
}

7.5 Confidence Calibration¶

Raw similarity scores need calibration to meaningful confidence levels:

from sklearn.isotonic import IsotonicRegression

class ConfidenceCalibrator:
    def __init__(self):
        self.calibrator = IsotonicRegression(out_of_bounds='clip')

    def fit(self, scores: list[float], labels: list[bool]):
        """Fit calibrator on historical match decisions."""
        self.calibrator.fit(scores, [1.0 if l else 0.0 for l in labels])

    def calibrate(self, score: float) -> float:
        """Convert raw score to calibrated probability."""
        return self.calibrator.predict([score])[0]

Target Calibration: A score of 0.80 should mean "80% of pairs with this score are true matches"

7.6 Fine-Tuning Pipeline¶

When sufficient training data is available (500+ pairs), fine-tune the embedding model:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

def fine_tune_on_matches(
    base_model: str,
    positive_pairs: list[tuple[str, str]],
    negative_pairs: list[tuple[str, str]],
    output_path: str
):
    """Fine-tune embedding model on prediction market pairs."""
    model = SentenceTransformer(base_model)

    # Create training examples
    train_examples = []
    for k_title, p_title in positive_pairs:
        train_examples.append(InputExample(texts=[k_title, p_title], label=1.0))
    for k_title, p_title in negative_pairs:
        train_examples.append(InputExample(texts=[k_title, p_title], label=0.0))

    # Use contrastive loss
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.ContrastiveLoss(model)

    # Fine-tune
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=3,
        warmup_steps=100,
        output_path=output_path
    )

    return model

Expected Improvement: +5-10% F1 score on domain-specific pairs after fine-tuning

8. Phase 4: LLM-Based Verification¶

8.1 Overview¶

LLM verification provides human-level reasoning for complex cases where algorithmic matching is uncertain. It excels at: - Understanding paraphrased questions - Comparing resolution criteria semantically - Identifying subtle differences in scope or timing

8.2 Verification Prompt Engineering¶

Primary Verification Prompt¶

<system>
You are an expert analyst for prediction markets. Your task is to determine if two markets from different platforms are semantically equivalent - meaning they will resolve the same way for the same real-world outcome.
</system>

<user>
Compare these two prediction markets:

**MARKET A (Kalshi)**
- Title: {kalshi_title}
- Resolution Criteria: {kalshi_rules}
- Expiration: {kalshi_expiration}

**MARKET B (Polymarket)**
- Title: {poly_title}
- Resolution Criteria: {poly_rules}
- Expiration: {poly_expiration}

Analyze the following:
1. **Same Event?** Are both markets about the identical real-world event (not just similar topics)?
2. **Outcome Alignment?** Would "Yes" on Market A always correspond to "Yes" on Market B?
3. **Resolution Compatibility?** Are the resolution criteria functionally equivalent?
4. **Timing Differences?** Could different resolution timing cause different outcomes?
5. **Scope Differences?** Do they cover the same geographic/temporal/jurisdictional scope?

Respond in JSON format:
{
  "equivalent": true|false,
  "confidence": 0.0-1.0,
  "same_event": true|false,
  "outcome_aligned": true|false,
  "resolution_compatible": true|false,
  "reasoning": "Brief explanation",
  "warnings": ["List of potential issues"],
  "resolution_differences": ["Specific criteria differences if any"]
}
</user>

Resolution Deep-Dive Prompt (for complex cases)¶

<system>
You are a legal analyst specializing in prediction market resolution criteria. Analyze resolution clauses for semantic equivalence.
</system>

<user>
Compare these resolution criteria in detail:

**Criteria A:**
{criteria_a}

**Criteria B:**
{criteria_b}

Analyze:
1. **Resolution Source**: Who/what determines the outcome? Same authority?
2. **Resolution Timing**: When is the outcome determined? Same timeframe?
3. **Threshold Definition**: What constitutes Yes vs No? Same threshold?
4. **Edge Cases**: How are ambiguous situations handled? Compatible?
5. **Invalidation Conditions**: What causes market cancellation? Same conditions?

Provide structured comparison with compatibility score (0-100).
</user>

8.3 Cost-Optimized Invocation Strategy¶

Tiered Model Selection¶

pub struct LlmVerifier {
    haiku_client: AnthropicClient,   // ~$0.001/verification
    sonnet_client: AnthropicClient,  // ~$0.01/verification
    daily_budget: AtomicU64,
    daily_spend: AtomicU64,
}

impl LlmVerifier {
    pub async fn verify(&self, pair: &CandidateMatch) -> Result<LlmResult, Error> {
        // Check budget
        if self.daily_spend.load(Ordering::SeqCst) >= self.daily_budget.load(Ordering::SeqCst) {
            return Err(Error::BudgetExceeded);
        }

        // Use Haiku for initial screening
        let haiku_result = self.verify_with_haiku(pair).await?;

        // Escalate to Sonnet if uncertain or high-value
        if haiku_result.confidence < 0.85 || pair.estimated_volume > 10_000.0 {
            let sonnet_result = self.verify_with_sonnet(pair).await?;
            return Ok(sonnet_result);
        }

        Ok(haiku_result)
    }
}

Invocation Rules¶

Condition	Action	Estimated Cost
Fingerprint score < 0.60	Skip LLM (reject)	$0
Fingerprint score 0.60-0.85	Invoke Haiku	$0.001
Fingerprint score > 0.85	Skip LLM (approve)	$0
Haiku uncertain (<0.85)	Escalate to Sonnet	$0.01
High-value market (>$10k volume)	Always use Sonnet	$0.01
Semantic warnings present	Always use Sonnet	$0.01

Budget Management¶

pub struct BudgetManager {
    daily_limit_cents: u64,
    current_spend_cents: AtomicU64,
    last_reset: AtomicU64,
}

impl BudgetManager {
    pub fn can_spend(&self, amount_cents: u64) -> bool {
        self.maybe_reset_daily();
        let current = self.current_spend_cents.load(Ordering::SeqCst);
        current + amount_cents <= self.daily_limit_cents
    }

    pub fn record_spend(&self, amount_cents: u64) {
        self.current_spend_cents.fetch_add(amount_cents, Ordering::SeqCst);
    }
}

Default Budget: $50/day (~5,000 Haiku calls or ~500 Sonnet calls)

8.4 Response Parsing and Validation¶

#[derive(Deserialize, Debug)]
pub struct LlmVerificationResult {
    pub equivalent: bool,
    pub confidence: f64,
    pub same_event: bool,
    pub outcome_aligned: bool,
    pub resolution_compatible: bool,
    pub reasoning: String,
    pub warnings: Vec<String>,
    pub resolution_differences: Vec<String>,
}

impl LlmVerificationResult {
    /// Validate LLM response for consistency
    pub fn validate(&self) -> Result<(), ValidationError> {
        // Confidence must be between 0 and 1
        if self.confidence < 0.0 || self.confidence > 1.0 {
            return Err(ValidationError::InvalidConfidence);
        }

        // If equivalent is true, all sub-checks should be true
        if self.equivalent && (!self.same_event || !self.outcome_aligned) {
            return Err(ValidationError::InconsistentFlags);
        }

        // Must have reasoning
        if self.reasoning.is_empty() {
            return Err(ValidationError::MissingReasoning);
        }

        Ok(())
    }

    /// Convert to human-readable report
    pub fn to_report(&self) -> String {
        format!(
            "Equivalent: {} (confidence: {:.0}%)\n\
             Reasoning: {}\n\
             Warnings: {}\n\
             Resolution Differences: {}",
            if self.equivalent { "Yes" } else { "No" },
            self.confidence * 100.0,
            self.reasoning,
            self.warnings.join(", "),
            self.resolution_differences.join("; ")
        )
    }
}

8.5 Human Review of LLM Decisions¶

Initially, all LLM-verified matches require human confirmation:

┌─────────────────────────────────────────────────────────────┐
│  LLM Verification Result                                    │
├─────────────────────────────────────────────────────────────┤
│  Kalshi: "Will Trump buy Greenland?"                       │
│  Polymarket: "Will the US acquire part of Greenland?"      │
│                                                             │
│  LLM Says: EQUIVALENT (confidence: 92%)                    │
│                                                             │
│  Reasoning: Both markets resolve on US acquisition of      │
│  Greenland territory. Kalshi frames as "Trump" action,     │
│  Polymarket as "US" action, but resolution criteria        │
│  both require actual transfer of territory.                │
│                                                             │
│  Warnings:                                                 │
│  - Different expiration dates (2029 vs 2026)               │
│                                                             │
│  [✓ Approve]  [✗ Reject]  [🔍 View Details]                │
└─────────────────────────────────────────────────────────────┘

Auto-Approval Criteria (after calibration)¶

After 100+ LLM decisions have been human-reviewed:

impl AutoApprovalPolicy {
    pub fn can_auto_approve(&self, result: &LlmVerificationResult) -> bool {
        // High confidence
        if result.confidence < 0.95 {
            return false;
        }

        // No warnings
        if !result.warnings.is_empty() {
            return false;
        }

        // No resolution differences
        if !result.resolution_differences.is_empty() {
            return false;
        }

        // Historical accuracy check
        if self.llm_historical_accuracy() < 0.98 {
            return false;
        }

        true
    }
}

9. Phase 5: Reinforcement Learning from Human Feedback¶

9.1 Overview¶

Human approval decisions are a rich source of training data. By systematically capturing and learning from these decisions, we can continuously improve all matching components.

Key Insight: Every human approval/rejection is a labeled training example that improves future matching accuracy.

9.2 Feedback Data Schema¶

CREATE TABLE match_decisions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    candidate_id UUID REFERENCES candidates(id),
    created_at TIMESTAMP DEFAULT now(),

    -- Decision
    decision TEXT CHECK (decision IN ('approved', 'rejected', 'modified')),
    reviewer_id TEXT NOT NULL,

    -- Context at decision time (for ML features)
    fingerprint_score REAL,
    embedding_similarity REAL,
    text_similarity REAL,
    llm_confidence REAL,
    llm_equivalent BOOLEAN,

    -- Human feedback
    rejection_reason TEXT,
    modification_notes TEXT,

    -- Entity corrections (for alias learning)
    entity_corrections JSONB,
    -- Example: {"kalshi_entity": "BTC", "poly_entity": "Bitcoin", "canonical": "Bitcoin"}

    -- Resolution analysis
    resolution_compatible BOOLEAN,
    resolution_notes TEXT,

    -- Training flags
    include_in_training BOOLEAN DEFAULT true,
    training_weight REAL DEFAULT 1.0  -- Higher for difficult cases
);

CREATE INDEX idx_decisions_training ON match_decisions(include_in_training)
    WHERE include_in_training = true;

9.3 Feedback Collection Pipeline¶

pub struct FeedbackCollector {
    storage: Arc<Storage>,
}

impl FeedbackCollector {
    /// Record a human decision with all context
    pub async fn record_decision(
        &self,
        candidate: &CandidateMatch,
        decision: Decision,
        reviewer: &str,
    ) -> Result<Uuid, Error> {
        let feedback = MatchDecision {
            id: Uuid::new_v4(),
            candidate_id: candidate.id,
            decision: decision.verdict,
            reviewer_id: reviewer.to_string(),

            // Capture all scores for feature analysis
            fingerprint_score: candidate.fingerprint_score,
            embedding_similarity: candidate.embedding_similarity,
            text_similarity: candidate.text_score,
            llm_confidence: candidate.llm_result.as_ref().map(|r| r.confidence),
            llm_equivalent: candidate.llm_result.as_ref().map(|r| r.equivalent),

            // Human feedback
            rejection_reason: decision.rejection_reason,
            entity_corrections: decision.entity_corrections,
            resolution_notes: decision.resolution_notes,

            include_in_training: true,
            training_weight: self.compute_training_weight(&candidate, &decision),
        };

        self.storage.insert_decision(&feedback).await?;

        // Trigger incremental learning if enabled
        if self.incremental_learning_enabled {
            self.trigger_incremental_update(&feedback).await?;
        }

        Ok(feedback.id)
    }

    /// Compute training weight (prioritize difficult/educational cases)
    fn compute_training_weight(&self, candidate: &CandidateMatch, decision: &Decision) -> f64 {
        let mut weight = 1.0;

        // Hard negatives (high score but rejected) are valuable
        if decision.verdict == "rejected" && candidate.fingerprint_score > 0.6 {
            weight *= 2.0;
        }

        // Cases with entity corrections are valuable
        if decision.entity_corrections.is_some() {
            weight *= 1.5;
        }

        // Edge cases near threshold are valuable
        if (candidate.fingerprint_score - 0.70).abs() < 0.1 {
            weight *= 1.5;
        }

        weight.min(5.0)  // Cap at 5x
    }
}

9.4 Automatic Alias Learning¶

pub struct AliasLearner {
    alias_db: Arc<RwLock<AliasDatabase>>,
}

impl AliasLearner {
    /// Learn from approved matches with entity differences
    pub async fn learn_from_approval(&self, decision: &MatchDecision) {
        // Learn from explicit corrections
        if let Some(corrections) = &decision.entity_corrections {
            for correction in corrections {
                self.add_alias(
                    &correction.canonical,
                    &correction.kalshi_entity,
                ).await;
                self.add_alias(
                    &correction.canonical,
                    &correction.poly_entity,
                ).await;
            }
        }

        // Learn implicit aliases from matched entity pairs
        let kalshi_entities = self.extract_entities(&decision.kalshi_title);
        let poly_entities = self.extract_entities(&decision.poly_title);

        for (k_ent, p_ent) in self.align_entities(&kalshi_entities, &poly_entities) {
            if k_ent.name != p_ent.name
                && k_ent.entity_type == p_ent.entity_type
                && self.string_similarity(&k_ent.name, &p_ent.name) < 0.5
            {
                // Different strings, same type, low similarity = alias
                log::info!("Learned alias: {} <-> {}", k_ent.name, p_ent.name);
                self.add_bidirectional_alias(&k_ent.name, &p_ent.name).await;
            }
        }
    }

    async fn add_alias(&self, canonical: &str, alias: &str) {
        let mut db = self.alias_db.write().await;
        db.add(canonical, alias);
    }
}

9.5 Fingerprint Weight Optimization¶

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

def optimize_weights(decisions: list[MatchDecision]) -> dict[str, float]:
    """
    Use logistic regression to find optimal field weights.

    Each decision provides features (field scores) and label (approved/rejected).
    The learned coefficients indicate optimal weights.
    """
    # Extract features
    X = np.array([
        [d.entity_score, d.date_score, d.threshold_score,
         d.outcome_score, d.source_score]
        for d in decisions
    ])

    # Labels: 1 for approved, 0 for rejected
    y = np.array([1 if d.decision == 'approved' else 0 for d in decisions])

    # Fit logistic regression
    model = LogisticRegression(penalty='l2', C=1.0)
    model.fit(X, y)

    # Cross-validation
    scores = cross_val_score(model, X, y, cv=5)
    print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

    # Extract and normalize weights
    raw_weights = np.abs(model.coef_[0])
    normalized = raw_weights / raw_weights.sum()

    return {
        'entity': normalized[0],
        'date': normalized[1],
        'threshold': normalized[2],
        'outcome': normalized[3],
        'source': normalized[4],
    }

9.6 Embedding Model Retraining¶

def retrain_embedding_model(
    base_model_path: str,
    decisions: list[MatchDecision],
    output_path: str
) -> EvaluationMetrics:
    """
    Retrain embedding model on accumulated human decisions.
    """
    # Prepare training data
    positive_pairs = [
        (d.kalshi_title, d.poly_title)
        for d in decisions if d.decision == 'approved'
    ]
    negative_pairs = [
        (d.kalshi_title, d.poly_title)
        for d in decisions if d.decision == 'rejected'
    ]

    # Load base model
    model = SentenceTransformer(base_model_path)

    # Create training examples with weights
    train_examples = []
    for d in decisions:
        if d.decision == 'approved':
            train_examples.append(
                InputExample(
                    texts=[d.kalshi_title, d.poly_title],
                    label=1.0
                )
            )
        elif d.decision == 'rejected' and d.fingerprint_score > 0.4:
            # Include hard negatives only
            train_examples.append(
                InputExample(
                    texts=[d.kalshi_title, d.poly_title],
                    label=0.0
                )
            )

    # Fine-tune
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.ContrastiveLoss(model)

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=2,
        warmup_steps=50
    )

    # Save
    model.save(output_path)

    # Evaluate on held-out test set
    return evaluate_model(output_path, test_pairs)

9.7 Continuous Improvement Pipeline¶

┌─────────────────────────────────────────────────────────────┐
│                  Weekly Improvement Cycle                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Monday: Data Export                                        │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Export new decisions from past week                    │ │
│  │ Update golden set with new test cases                  │ │
│  │ Calculate current metrics baseline                     │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Tuesday: Model Training                                    │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Retrain embedding model on accumulated decisions       │ │
│  │ Optimize fingerprint weights via logistic regression   │ │
│  │ Update alias database with learned aliases             │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Wednesday: Validation                                      │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Evaluate new models on golden set                      │ │
│  │ Compare metrics to baseline                            │ │
│  │ Flag any regressions                                   │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Thursday-Saturday: A/B Testing                             │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ Deploy new model to 10% of traffic                     │ │
│  │ Monitor precision/recall in production                 │ │
│  │ Collect additional feedback                            │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
│  Sunday: Promotion Decision                                 │
│  ┌───────────────────────────────────────────────────────┐ │
│  │ If A/B metrics improve: promote new model to 100%      │ │
│  │ If metrics regress: rollback to previous version       │ │
│  │ Update model registry with results                     │ │
│  └───────────────────────────────────────────────────────┘ │
│                                                             │
└─────────────────────────────────────────────────────────────┘

9.8 Model Versioning and Rollback¶

pub struct ModelRegistry {
    storage: Arc<Storage>,
    current_version: AtomicU64,
}

impl ModelRegistry {
    /// Register a new model version
    pub async fn register_version(
        &self,
        model_type: ModelType,
        artifact_path: &str,
        metrics: &EvaluationMetrics,
    ) -> Result<ModelVersion, Error> {
        let version = ModelVersion {
            id: Uuid::new_v4(),
            model_type,
            artifact_path: artifact_path.to_string(),
            precision: metrics.precision,
            recall: metrics.recall,
            f1_score: metrics.f1_score,
            created_at: Utc::now(),
            is_active: false,
            training_decisions_count: metrics.training_size,
        };

        self.storage.insert_model_version(&version).await?;
        Ok(version)
    }

    /// Promote a version to active (with automatic rollback on failure)
    pub async fn promote(&self, version_id: Uuid) -> Result<(), Error> {
        let previous = self.get_active_version().await?;

        // Activate new version
        self.storage.set_active_version(version_id).await?;

        // Monitor for 1 hour
        tokio::time::sleep(Duration::from_secs(3600)).await;

        // Check if metrics degraded
        let live_metrics = self.collect_live_metrics().await?;
        if live_metrics.f1_score < previous.f1_score - 0.02 {
            log::warn!("New model degraded metrics, rolling back");
            self.storage.set_active_version(previous.id).await?;
            return Err(Error::RollbackTriggered);
        }

        Ok(())
    }
}

10. Operational Excellence¶

This section covers deployment, monitoring, security, and operational considerations for production deployment of the market discovery system.

Availability Target: 99.9% uptime (43.8 minutes/month downtime allowed)

10.1 Deployment Architecture¶

The discovery feature deploys as part of the Trading Core ECS service with optional feature flag enablement:

┌─────────────────────────────────────────────────────────────┐
│                      AWS Region                              │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │                  ECS Cluster                         │   │
│  │  ┌───────────────────┐  ┌───────────────────┐       │   │
│  │  │   Trading Core    │  │   Trading Core    │       │   │
│  │  │  (--features      │  │  (--features      │       │   │
│  │  │   discovery)      │  │   discovery)      │       │   │
│  │  │                   │  │                   │       │   │
│  │  │  ┌─────────────┐  │  │  ┌─────────────┐  │       │   │
│  │  │  │  Scanner    │  │  │  │  Scanner    │  │       │   │
│  │  │  │  Actor      │  │  │  │  Actor      │  │       │   │
│  │  │  └─────────────┘  │  │  └─────────────┘  │       │   │
│  │  │                   │  │                   │       │   │
│  │  │  ┌─────────────┐  │  │  ┌─────────────┐  │       │   │
│  │  │  │  SQLite     │  │  │  │  SQLite     │  │       │   │
│  │  │  │  (local)    │  │  │  │  (local)    │  │       │   │
│  │  │  └─────────────┘  │  │  └─────────────┘  │       │   │
│  │  └─────────┬─────────┘  └─────────┬─────────┘       │   │
│  │            │                      │                  │   │
│  └────────────┼──────────────────────┼──────────────────┘   │
│               │                      │                       │
│               ▼                      ▼                       │
│  ┌─────────────────────────────────────────────────────┐   │
│  │           Aurora PostgreSQL (future)                 │   │
│  │     (shared state for multi-instance scaling)        │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                 AWS Secrets Manager                  │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │   │
│  │  │ Kalshi Keys │  │ Poly Keys   │  │ LLM API Keys│  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Single-Instance MVP: - One scanner active at a time (ECS desired count = 1) - SQLite local storage sufficient for ~10,000 markets - No coordination needed between instances

Multi-Instance (Future Scaling): - PostgreSQL for shared candidate storage - Distributed locking for scan coordination (Redis) - Leader election for single-scanner pattern

10.2 CI/CD Pipeline¶

Feature-Gated Testing:

# .github/workflows/ci.yml
name: CI

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  base-tests:
    name: Base Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-action@stable
      - run: cargo test --manifest-path arbiter-engine/Cargo.toml

  discovery-tests:
    name: Discovery Feature Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-action@stable

      - name: Run unit tests
        run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery

      - name: Run integration tests
        run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
        env:
          KALSHI_DEMO_KEY_ID: ${{ secrets.KALSHI_DEMO_KEY_ID }}
          KALSHI_DEMO_PRIVATE_KEY: ${{ secrets.KALSHI_DEMO_PRIVATE_KEY }}

  security-audit:
    name: Security Audit
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: cargo install cargo-audit
      - run: cargo audit --manifest-path arbiter-engine/Cargo.toml

Deployment Pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    Deployment Pipeline                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌────────────┐    ┌────────────┐    ┌────────────┐         │
│  │    PR      │───►│   CI/CD    │───►│   Review   │         │
│  │  Created   │    │   Tests    │    │  Required  │         │
│  └────────────┘    └────────────┘    └────────────┘         │
│                           │                  │               │
│                           ▼                  ▼               │
│                    ┌────────────┐    ┌────────────┐         │
│                    │   Merge    │◄───│  Approval  │         │
│                    │  to main   │    │            │         │
│                    └────────────┘    └────────────┘         │
│                           │                                  │
│                           ▼                                  │
│                    ┌────────────┐                            │
│                    │   Build    │                            │
│                    │   Docker   │                            │
│                    └────────────┘                            │
│                           │                                  │
│              ┌────────────┼────────────┐                    │
│              ▼            ▼            ▼                    │
│       ┌──────────┐ ┌──────────┐ ┌──────────┐               │
│       │  Staging │ │    E2E   │ │  Council │               │
│       │  Deploy  │ │  Tests   │ │  Review  │               │
│       └──────────┘ └──────────┘ └──────────┘               │
│              │            │            │                    │
│              └────────────┼────────────┘                    │
│                           ▼                                  │
│                    ┌────────────┐                            │
│                    │ Production │                            │
│                    │   Deploy   │                            │
│                    └────────────┘                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

10.3 Monitoring & Observability¶

CloudWatch Metrics:

Metric	Type	Description	Dashboard
`discovery/scan/duration`	Timer	Scan cycle duration	Discovery Health
`discovery/scan/errors`	Counter	Scan failures	Discovery Health
`discovery/candidates/count`	Gauge	Candidates generated	Candidate Funnel
`discovery/candidates/pending`	Gauge	Awaiting review	Candidate Funnel
`discovery/candidates/approved`	Counter	Approved matches	Candidate Funnel
`discovery/candidates/rejected`	Counter	Rejected candidates	Candidate Funnel
`discovery/api/rate_limits`	Counter	Rate limit errors	API Performance
`discovery/api/latency`	Timer	API response time	API Performance
`discovery/approvals/rate`	Gauge	Approval percentage	Quality Metrics

Structured Logging:

// Example structured log output
tracing::info!(
    scan_id = %scan_id,
    platform = "polymarket",
    markets_fetched = markets.len(),
    candidates_generated = candidates.len(),
    duration_ms = elapsed.as_millis(),
    "Discovery scan completed"
);

Log Levels: - ERROR: Scan failures, API errors, database corruption - WARN: Rate limit warnings, retry attempts, degraded mode - INFO: Scan completions, candidate counts, approval decisions - DEBUG: Individual market processing, matching scores

Dashboard Panels:

Discovery Health Overview
Scan success rate (24h rolling)
Average scan duration
Error count by type
Candidate Funnel
Generated → Pending → Approved/Rejected
Conversion rates
Time in pending state
API Performance
Latency p50/p95/p99 by platform
Rate limit error rate
Request volume

10.4 Security Hardening¶

API Security:

Platform	Authentication	Credential Storage	Rotation
Polymarket Gamma	None (public)	N/A	N/A
Kalshi	RSA-PSS	AWS Secrets Manager	90 days
LLM APIs (Phase 4)	API Key	AWS Secrets Manager	30 days

Rate Limiting: - Token bucket implementation prevents API abuse - Configurable limits per platform - Automatic backoff on 429 responses

Audit Trail (FR-MD-009):

{
  "timestamp": "2026-01-23T10:30:00Z",
  "event_type": "candidate_approved",
  "candidate_id": "uuid-here",
  "reviewer_id": "operator@example.com",
  "kalshi_market": "KXGREENLAND-29",
  "poly_market": "greenland-2026",
  "fingerprint_score": 0.82,
  "warnings_acknowledged": ["Different expiration dates"],
  "decision_notes": "Verified resolution criteria compatible"
}

Access Control (Future): - Discovery CLI requires shell access (current) - RBAC for approval workflow (Phase 2) - Audit trail for all decisions

10.5 Scaling Strategy¶

Phase 1 (MVP - Current):

Parameter	Value
Scanner instances	1
Markets per platform	~2,000
Scan interval	1 hour
Storage	SQLite (local)
Estimated cost	~$50/month (ECS)

Phase 2 (Scale):

Parameter	Value
Scanner instances	2-3 (leader election)
Markets per platform	~10,000
Scan interval	15 minutes
Storage	PostgreSQL (Aurora)
Estimated cost	~$200/month

Phase 3 (Embedding):

Parameter	Value
Embedding service	Separate container
Vector database	pgvector extension
GPU acceleration	Optional (batch jobs)
Estimated cost	~$50/month additional

Phase 4 (LLM):

Parameter	Value
LLM service	Claude API
Budget controls	$50/day default
Caching	Response cache (24h)
Estimated cost	~$50-150/month

10.6 Disaster Recovery¶

Backup Strategy:

Data	Backup Frequency	Retention	Storage
SQLite DB	Daily	30 days	S3 Standard
Audit logs	Hourly	90 days hot, 7 years cold	S3 + Glacier
Configuration	On change	1 year	S3 + Git

Recovery Procedures:

Scenario	RTO	RPO	Procedure
Scanner crash	5 min	0	ECS auto-restart, health checks
Database corruption	30 min	24 hr	Restore from S3, verify integrity
API outage (external)	N/A	N/A	Graceful degradation, alerts
Region failure	4 hr	1 hr	Cross-region restore, DNS failover

Graceful Degradation Modes:

Polymarket API Down:
Continue scanning Kalshi only
Alert on-call
Retry with exponential backoff
Kalshi API Down:
Continue scanning Polymarket only
Alert on-call
Retry with exponential backoff
Database Unavailable:
Enter read-only mode
Serve cached candidates
Alert P1
Embedding Service Down (Phase 3):
Fallback to fingerprint-only matching
Log degraded mode
No data loss

11. Risk Analysis¶

11.1 Technical Risks¶

Risk	Probability	Impact	Mitigation
Entity extraction misses novel entities	Medium	Medium	Extensible pattern system, ML fallback
False positive matches lead to bad trades	Low	High	Human verification required (FR-MD-003)
API rate limiting blocks discovery	Medium	Low	Configurable backoff, caching
Resolution criteria differ subtly	High	High	Semantic warning system

11.2 Operational Risks¶

Risk	Probability	Impact	Mitigation
External service (Matchr/Dome) becomes unavailable	Medium	Low	Local matching is primary
Market format changes break parsing	Low	Medium	Robust error handling, alerts
High volume of candidates overwhelms reviewers	Medium	Medium	Confidence thresholds, batching

11.3 Business Risks¶

Risk	Probability	Impact	Mitigation
Competitors adopt better matching	Medium	Medium	Modular architecture allows upgrades
Platform TOS prohibit cross-platform arbitrage	Low	High	Legal review, compliance monitoring

12. Conclusion¶

12.1 Recommended Approach¶

Based on our analysis, we recommend a five-phase approach with progressively sophisticated matching:

Phase 1: Text similarity matching ✅ (Implemented)
Phase 2: Fingerprint-based matching with rule-based NER (Proposed)
Phase 3: Embedding-based semantic matching (hybrid scoring) (Proposed)
Phase 4: LLM verification for uncertain/high-value matches (Proposed)
Phase 5: Continuous improvement via human feedback learning (Proposed)

This approach: - Addresses the fundamental failure of text similarity - Aligns with industry best practices (entity extraction) - Preserves human-in-the-loop safety requirements - Creates a virtuous cycle where human decisions improve future matching - Provides graceful degradation (each phase works independently)

12.2 Implementation Priority¶

Phase	Scope	Priority	Effort
2a	Fingerprint schema + rule-based NER	Must	Medium
2b	Fingerprint matcher + weighted scoring	Must	Medium
2c	Golden set validation + tuning	Must	Low
3a	Embedding infrastructure + model selection	Should	Medium
3b	Hybrid scoring integration	Should	Medium
3c	Embedding fine-tuning pipeline	Could	High
4a	LLM verification prompts + integration	Should	Medium
4b	Automated escalation rules	Should	Low
4c	Resolution deep analysis	Could	Medium
5a	Feedback data collection	Should	Low
5b	Automatic alias/weight learning	Should	Medium
5c	Continuous retraining pipeline	Could	High

12.3 Success Criteria¶

Metric	Phase 2 Target	Phase 3+ Target
Recall	≥ 70%	≥ 85%
Precision	≥ 90%	≥ 95%
F1 Score	≥ 0.78	≥ 0.90
Latency (p99)	≤ 50ms	≤ 200ms
Human verification	100%	100% (safety preserved)

12.4 Key Innovation: Learning from Human Decisions¶

The most significant architectural decision is treating human approvals/rejections as training data:

                    ┌─────────────────┐
                    │  Human Reviews  │
                    └────────┬────────┘
                             │
        ┌────────────────────┼────────────────────┐
        ▼                    ▼                    ▼
┌───────────────┐  ┌─────────────────┐  ┌───────────────┐
│ Alias Updates │  │  Weight Tuning  │  │ Model Retrain │
└───────────────┘  └─────────────────┘  └───────────────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             ▼
                    ┌─────────────────┐
                    │ Improved Models │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │ Better Matches  │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Human Reviews  │ ← (cycle continues)
                    └─────────────────┘

This creates a data flywheel where each human decision makes the system smarter, reducing future human workload while maintaining safety.

13. References¶

Industry Tools¶

pmxt - Unified API for prediction markets
Dome API - Developer infrastructure
Matchr - Cross-platform aggregator
EventArb - Arbitrage calculator
Polymarket-Kalshi-Arbitrage-Bot

Research¶

API Documentation¶

Appendix A: Entity Pattern Reference¶

// Full entity pattern list for rule-based NER
const ENTITY_PATTERNS: &[(&str, EntityType)] = &[
    // Politicians
    (r"(?i)\bTrump\b", EntityType::Person),
    (r"(?i)\bBiden\b", EntityType::Person),
    (r"(?i)\bHarris\b", EntityType::Person),
    (r"(?i)\bObama\b", EntityType::Person),
    (r"(?i)\bDeSantis\b", EntityType::Person),
    (r"(?i)\bNewsom\b", EntityType::Person),

    // Tech figures
    (r"(?i)\bMusk\b", EntityType::Person),
    (r"(?i)\bZuckerberg\b", EntityType::Person),
    (r"(?i)\bAltman\b", EntityType::Person),

    // Cryptocurrencies
    (r"(?i)\b(Bitcoin|BTC)\b", EntityType::Asset),
    (r"(?i)\b(Ethereum|ETH)\b", EntityType::Asset),
    (r"(?i)\b(Solana|SOL)\b", EntityType::Asset),
    (r"(?i)\b(XRP|Ripple)\b", EntityType::Asset),

    // Stocks/Indices
    (r"(?i)\b(S&P|SPX|SPY)\b", EntityType::Asset),
    (r"(?i)\b(Nasdaq|QQQ)\b", EntityType::Asset),
    (r"(?i)\b(Tesla|TSLA)\b", EntityType::Asset),
    (r"(?i)\b(Nvidia|NVDA)\b", EntityType::Asset),

    // Central banks
    (r"(?i)\b(Fed|Federal Reserve|FOMC)\b", EntityType::Institution),
    (r"(?i)\b(ECB)\b", EntityType::Institution),
    (r"(?i)\b(BoE|Bank of England)\b", EntityType::Institution),
    (r"(?i)\b(BoJ|Bank of Japan)\b", EntityType::Institution),

    // Economic indicators
    (r"(?i)\bCPI\b", EntityType::EconomicIndicator),
    (r"(?i)\bGDP\b", EntityType::EconomicIndicator),
    (r"(?i)\bNFP\b", EntityType::EconomicIndicator),
    (r"(?i)\b(unemployment|jobless)\b", EntityType::EconomicIndicator),
    (r"(?i)\binflation\b", EntityType::EconomicIndicator),

    // Sports events
    (r"(?i)\bSuper Bowl\b", EntityType::Event),
    (r"(?i)\bWorld Series\b", EntityType::Event),
    (r"(?i)\bNBA Finals\b", EntityType::Event),
    (r"(?i)\bStanley Cup\b", EntityType::Event),
    (r"(?i)\bWorld Cup\b", EntityType::Event),

    // Locations
    (r"(?i)\bGreenland\b", EntityType::Location),
    (r"(?i)\bUkraine\b", EntityType::Location),
    (r"(?i)\bTaiwan\b", EntityType::Location),
    (r"(?i)\bPanama\b", EntityType::Location),

    // Price targets
    (r"\$[\d,]+(?:\.\d+)?(?:k|K|M|B)?", EntityType::PriceTarget),

    // Dates
    (r"(?i)\b(20\d{2})\b", EntityType::Year),
    (r"(?i)\bQ[1-4]\b", EntityType::Quarter),
];

Appendix B: Sample Fingerprint Extraction¶

Input (Kalshi):

Title: "Will Trump buy Greenland?"
Rules: "Resolves Yes if US purchases at least part of Greenland from Denmark before January 20, 2029"

Output:

{
  "entity": {
    "name": "Trump",
    "entity_type": "Person",
    "aliases": ["Donald Trump", "DJT"]
  },
  "secondary_entities": [
    { "name": "Greenland", "entity_type": "Location" },
    { "name": "Denmark", "entity_type": "Location" },
    { "name": "US", "entity_type": "Location" }
  ],
  "event_type": "Acquisition",
  "metric": null,
  "scope": { "region": "US", "jurisdiction": "Federal" },
  "resolution": {
    "date": "2029-01-20",
    "timezone": null,
    "source": null,
    "criteria": "US purchases at least part of Greenland from Denmark"
  },
  "outcomes": {
    "outcome_type": "Binary",
    "outcomes": ["Yes", "No"]
  }
}

End of White Paper