Market Discovery Phase 2: Text Matching Engine¶

This post covers Phase 2 of ADR-017 - implementing the text similarity matching engine that powers automated market discovery between Polymarket and Kalshi.

The Problem¶

Phase 1 established the data types and storage layer. Now we need to actually find matching markets across platforms. The challenge:

Fuzzy matching - Market titles differ in phrasing ("Will Trump win?" vs "Trump wins 2024?")
False positives - Similar titles may have different settlement criteria
Scalability - Must compare thousands of markets efficiently

Algorithm Design¶

Combined Similarity Scoring¶

We use a weighted combination of two complementary algorithms:

score = 0.6 × Jaccard + 0.4 × Levenshtein

Jaccard similarity (0.6 weight) measures token set overlap:

let intersection = set_a.intersection(&set_b).count();
let union = set_a.union(&set_b).count();
jaccard = intersection / union

This captures semantic similarity when words are reordered.

Levenshtein similarity (0.4 weight) measures edit distance:

let distance = levenshtein(&norm_a, &norm_b);
levenshtein_sim = 1.0 - (distance / max_length)

This catches typos and minor variations.

Text Normalization¶

Before comparison, titles are normalized:

impl TextNormalizer {
    pub fn normalize(&self, text: &str) -> String {
        // 1. Lowercase
        // 2. Replace punctuation with spaces
        // 3. Collapse whitespace
    }

    pub fn tokenize(&self, text: &str) -> Vec<String> {
        // 4. Split into words
        // 5. Filter stop words (a, an, the, will, be, ...)
    }
}

Example: "Will Bitcoin reach $100k?" → ["bitcoin", "reach", "100k"]

Pre-Filtering¶

Before scoring, candidates are filtered to reduce false positives:

Filter	Default	Purpose
Expiration tolerance	±7 days	Markets must settle around same time
Outcome count	Must match	Binary vs multi-outcome
Category match	Optional	Same topic area

Semantic Warning Detection (FR-MD-008)¶

Even similar titles may have different settlement criteria. We detect and flag:

Conditional language mismatches:

Polymarket: "Will Fed announce rate cut?"
Kalshi:     "Will Fed cut rates?"
⚠️ Warning: Settlement trigger mismatch - one market references 'announce'

Resolution source differences:

Polymarket resolution: "Associated Press"
Kalshi resolution:     "Official FEC results"
⚠️ Warning: Resolution source differs

Expiration differences:

⚠️ Warning: Expiration differs by 3 day(s)

These warnings flow to the human reviewer (FR-MD-003) for acknowledgment before approval.

Implementation¶

SimilarityScorer¶

pub struct SimilarityScorer {
    jaccard_weight: f64,      // 0.6
    levenshtein_weight: f64,  // 0.4
    threshold: f64,           // 0.6
    normalizer: TextNormalizer,
    pre_filter: PreFilterConfig,
}

impl SimilarityScorer {
    pub fn find_matches(
        &self,
        market: &DiscoveredMarket,
        candidates: &[DiscoveredMarket],
    ) -> Vec<CandidateMatch> {
        candidates.iter()
            .filter(|c| c.platform != market.platform)  // Cross-platform only
            .filter(|c| self.passes_pre_filter(market, c))
            .filter_map(|c| {
                let score = self.score(&market.title, &c.title);
                if score >= self.threshold {
                    let warnings = self.detect_warnings(market, c);
                    Some(CandidateMatch::new(/*...*/).with_warnings(warnings))
                } else {
                    None
                }
            })
            .collect()
    }
}

Match Reason Classification¶

let match_reason = if score >= 0.95 {
    MatchReason::ExactTitle
} else {
    MatchReason::HighTextSimilarity { score: (score * 100.0) as u32 }
};

Test Coverage¶

Phase 2 adds 10 tests (22 total for discovery module):

Module	Tests	Focus
`normalizer.rs`	3	Lowercase, punctuation, tokenization
`matcher.rs`	7	Jaccard, Levenshtein, combined score, filtering, warnings

Key test:

#[test]
fn test_semantic_warning_announcement() {
    let scorer = SimilarityScorer::default();

    let poly = create_market(Platform::Polymarket, "Will Fed announce rate cut?");
    let kalshi = create_market(Platform::Kalshi, "Will Fed cut rates?");

    let warnings = scorer.detect_warnings(&poly, &kalshi);
    assert!(warnings.iter().any(|w| w.contains("announce")));
}

What's Next¶

Phase 3 will implement the API clients:

Polymarket Gamma API client (FR-MD-006)
Kalshi /v2/markets API client (FR-MD-007)
Rate limiting and pagination

Council Review¶

Phase 2 passed council verification with confidence 0.85. Key findings:

No unsafe code
Human-in-the-loop preserved (find_matches returns candidates, not verified mappings)
Semantic warnings properly flag settlement differences
All tests passing (22 total)

Implementation: arbiter-engine/src/discovery/{normalizer,matcher}.rs | Issue: #43 | ADR: 017