Skip to content

Market Discovery Phase 2: Text Matching Engine

This post covers Phase 2 of ADR-017 - implementing the text similarity matching engine that powers automated market discovery between Polymarket and Kalshi.

The Problem

Phase 1 established the data types and storage layer. Now we need to actually find matching markets across platforms. The challenge:

  1. Fuzzy matching - Market titles differ in phrasing ("Will Trump win?" vs "Trump wins 2024?")
  2. False positives - Similar titles may have different settlement criteria
  3. Scalability - Must compare thousands of markets efficiently

Algorithm Design

Combined Similarity Scoring

We use a weighted combination of two complementary algorithms:

score = 0.6 × Jaccard + 0.4 × Levenshtein

Jaccard similarity (0.6 weight) measures token set overlap:

let intersection = set_a.intersection(&set_b).count();
let union = set_a.union(&set_b).count();
jaccard = intersection / union

This captures semantic similarity when words are reordered.

Levenshtein similarity (0.4 weight) measures edit distance:

let distance = levenshtein(&norm_a, &norm_b);
levenshtein_sim = 1.0 - (distance / max_length)

This catches typos and minor variations.

Text Normalization

Before comparison, titles are normalized:

impl TextNormalizer {
    pub fn normalize(&self, text: &str) -> String {
        // 1. Lowercase
        // 2. Replace punctuation with spaces
        // 3. Collapse whitespace
    }

    pub fn tokenize(&self, text: &str) -> Vec<String> {
        // 4. Split into words
        // 5. Filter stop words (a, an, the, will, be, ...)
    }
}

Example: "Will Bitcoin reach $100k?"["bitcoin", "reach", "100k"]

Pre-Filtering

Before scoring, candidates are filtered to reduce false positives:

Filter Default Purpose
Expiration tolerance ±7 days Markets must settle around same time
Outcome count Must match Binary vs multi-outcome
Category match Optional Same topic area

Semantic Warning Detection (FR-MD-008)

Even similar titles may have different settlement criteria. We detect and flag:

Conditional language mismatches:

Polymarket: "Will Fed announce rate cut?"
Kalshi:     "Will Fed cut rates?"
⚠️ Warning: Settlement trigger mismatch - one market references 'announce'

Resolution source differences:

Polymarket resolution: "Associated Press"
Kalshi resolution:     "Official FEC results"
⚠️ Warning: Resolution source differs

Expiration differences:

⚠️ Warning: Expiration differs by 3 day(s)

These warnings flow to the human reviewer (FR-MD-003) for acknowledgment before approval.

Implementation

SimilarityScorer

pub struct SimilarityScorer {
    jaccard_weight: f64,      // 0.6
    levenshtein_weight: f64,  // 0.4
    threshold: f64,           // 0.6
    normalizer: TextNormalizer,
    pre_filter: PreFilterConfig,
}

impl SimilarityScorer {
    pub fn find_matches(
        &self,
        market: &DiscoveredMarket,
        candidates: &[DiscoveredMarket],
    ) -> Vec<CandidateMatch> {
        candidates.iter()
            .filter(|c| c.platform != market.platform)  // Cross-platform only
            .filter(|c| self.passes_pre_filter(market, c))
            .filter_map(|c| {
                let score = self.score(&market.title, &c.title);
                if score >= self.threshold {
                    let warnings = self.detect_warnings(market, c);
                    Some(CandidateMatch::new(/*...*/).with_warnings(warnings))
                } else {
                    None
                }
            })
            .collect()
    }
}

Match Reason Classification

let match_reason = if score >= 0.95 {
    MatchReason::ExactTitle
} else {
    MatchReason::HighTextSimilarity { score: (score * 100.0) as u32 }
};

Test Coverage

Phase 2 adds 10 tests (22 total for discovery module):

Module Tests Focus
normalizer.rs 3 Lowercase, punctuation, tokenization
matcher.rs 7 Jaccard, Levenshtein, combined score, filtering, warnings

Key test:

#[test]
fn test_semantic_warning_announcement() {
    let scorer = SimilarityScorer::default();

    let poly = create_market(Platform::Polymarket, "Will Fed announce rate cut?");
    let kalshi = create_market(Platform::Kalshi, "Will Fed cut rates?");

    let warnings = scorer.detect_warnings(&poly, &kalshi);
    assert!(warnings.iter().any(|w| w.contains("announce")));
}

What's Next

Phase 3 will implement the API clients:

  • Polymarket Gamma API client (FR-MD-006)
  • Kalshi /v2/markets API client (FR-MD-007)
  • Rate limiting and pagination

Council Review

Phase 2 passed council verification with confidence 0.85. Key findings:

  • No unsafe code
  • Human-in-the-loop preserved (find_matches returns candidates, not verified mappings)
  • Semantic warnings properly flag settlement differences
  • All tests passing (22 total)

Implementation: arbiter-engine/src/discovery/{normalizer,matcher}.rs | Issue: #43 | ADR: 017