Cross-Platform Prediction Market Matching: A Technical Analysis¶
Version: 1.0 Date: 2026-01-23 Authors: Arbiter-Bot Engineering Team Status: Draft for Council Review
Executive Summary¶
This white paper analyzes approaches for automatically matching equivalent prediction markets across Polymarket and Kalshi. We evaluate five solution approaches based on accuracy, cost, complexity, and production readiness. Our analysis, informed by industry research and empirical testing, recommends a fingerprint-based matching pipeline as the optimal approach for production deployment.
Key findings: - Pure text similarity (Jaccard/Levenshtein) achieves only 8-9% similarity on semantically equivalent markets - Industry tools universally use entity extraction, manual curation, or hybrid approaches - A three-stage pipeline (candidate generation → fingerprint matching → human verification) balances accuracy, explainability, and safety
Table of Contents¶
- Problem Statement
- Industry Landscape
- Solution Options Analysis
- Recommended Architecture
- Implementation Considerations
- Evaluation Framework
- Phase 3: Embedding-Based Semantic Matching
- Phase 4: LLM-Based Verification
- Phase 5: Reinforcement Learning from Human Feedback
- Operational Excellence
- Risk Analysis
- Conclusion
- References
1. Problem Statement¶
1.1 The Matching Challenge¶
Prediction markets on Polymarket and Kalshi frequently cover the same real-world events but use different: - Market titles: "Will Trump buy Greenland?" vs "Will the US acquire part of Greenland in 2026?" - Resolution criteria: "OPM shutdown announcement" vs "actual shutdown exceeding 24 hours" - Outcome structures: Binary (Yes/No) vs multi-outcome ranges - Identifiers: No shared ID system across platforms
1.2 Why Matching Matters¶
| Use Case | Impact |
|---|---|
| Arbitrage detection | Price discrepancies between equivalent markets create profit opportunities |
| Portfolio hedging | Cross-platform positions require matched market identification |
| Market analysis | Aggregated data across platforms improves price discovery research |
| Liquidity routing | Smart order routing requires knowing equivalent markets |
1.3 Empirical Evidence: Text Similarity Fails¶
We tested text similarity (Jaccard + Levenshtein) on known market pairs:
| Kalshi Title | Polymarket Title | Jaccard | Combined Score |
|---|---|---|---|
| "Will Trump buy Greenland?" | "Will the US acquire part of Greenland in 2026?" | 8.3% | 22.1% |
| "Will Washington win the 2026 Pro Football Championship?" | "Super Bowl Champion 2026" | 9.1% | 18.5% |
| "Fed rate cut before June 2026?" | "FOMC to lower rates in Q2 2026?" | 12.4% | 24.8% |
Conclusion: Text similarity algorithms cannot reliably identify semantically equivalent markets. A 60% threshold would miss all valid matches; a 10% threshold would generate thousands of false positives.
2. Industry Landscape¶
2.1 Commercial Solutions¶
| Tool | Approach | Matching Method | Limitations |
|---|---|---|---|
| Matchr | Curated aggregator | Human-curated database of 1,500+ matched markets | Not programmable, no API |
| Dome API | Unified API | Manual market mapping by Dome team | Subscription cost, external dependency |
| EventArb | Arbitrage calculator | Manual market selection by user | No automated discovery |
| Verso | Terminal UI | Internal normalization layer | Closed source, no matching API |
2.2 Open Source Solutions¶
| Tool | Approach | Matching Method | Limitations |
|---|---|---|---|
| pmxt | Unified library | Slug-based configuration | Manual matching required |
| Polymarket-Kalshi-Arbitrage-Bot | Arbitrage bot | Entity extraction + text similarity | Limited documentation |
| Various GitHub bots | Custom implementations | Heuristic string/date matching | Brittle, unmaintained |
2.3 Key Industry Insight¶
No production tool relies solely on text similarity. All successful implementations use one or more of:
- Manual curation: Human-verified match databases (Matchr, Dome)
- Slug/ID configuration: User specifies which markets to compare (pmxt)
- Entity extraction: Extract structured fields and match on semantics
- Hybrid approaches: Combine multiple signals with human verification
3. Solution Options Analysis¶
3.1 Option A: Pure Text Similarity¶
Approach: Compute string similarity (Jaccard, Levenshtein, cosine) between market titles.
| Criterion | Assessment |
|---|---|
| Accuracy | Low (8-9% on real pairs) |
| Cost | No per-match API costs |
| Latency | < 1ms |
| Complexity | Low |
| Explainability | High |
| Production Ready | No |
Why it fails: - Synonym blindness: "Super Bowl" ≠ "Pro Football Championship" - Paraphrase blindness: "buy" ≠ "acquire" - Stop word dilution: Signal words overwhelmed by common words
Verdict: Insufficient for production use.
3.2 Option B: Fingerprint-Based Matching¶
Approach: Extract structured "fingerprints" from markets and match on canonical fields.
struct MarketFingerprint {
entity: String, // "Trump", "Bitcoin", "Fed"
secondary_entities: Vec<String>,
event_type: EventType, // Election, PriceTarget, Economic
metric: Option<MetricSpec>, // "price >= $100,000"
resolution_date: Option<Date>,
resolution_source: Option<String>,
outcome_type: OutcomeType, // Binary, Multi, Range
}
| Criterion | Assessment |
|---|---|
| Accuracy | High (matches on semantics) |
| Cost | No per-match API costs (local on cached data) |
| Latency | ~10ms (entity extraction) |
| Complexity | Medium |
| Explainability | High (field-by-field) |
| Production Ready | Yes |
Algorithm: 1. Extract fingerprint from each market title + rules 2. Generate candidates by keyword/date overlap (fast) 3. Score fingerprint similarity with weighted fields 4. Create candidate for human review if score ≥ 0.70
Field weights (empirically tuned): - Entity match: 30% - Date match: 25% - Threshold match: 20% - Outcome structure: 15% - Resolution source: 10%
Verdict: Recommended for production.
3.3 Option C: Embedding-Based Semantic Matching¶
Approach: Generate dense vector embeddings of market titles and compute cosine similarity.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
emb_kalshi = model.encode("Will Trump buy Greenland?")
emb_poly = model.encode("Will the US acquire part of Greenland in 2026?")
similarity = cosine_similarity(emb_kalshi, emb_poly)
# Expected: ~0.75-0.85 (much better than Jaccard)
| Criterion | Assessment |
|---|---|
| Accuracy | Highest (captures semantic meaning) |
| Cost | ~$0.0001/embedding (API) or free (local) |
| Latency | 50-200ms per embedding |
| Complexity | High (embedding service, vector DB) |
| Explainability | Low (black box similarity) |
| Production Ready | Yes, but overkill for MVP |
Advantages: - Captures semantic equivalence automatically - No manual entity pattern maintenance - Works on novel market types
Disadvantages: - Black box: hard to explain why two markets match - Requires embedding infrastructure - May match semantically similar but not identical markets
Verdict: Recommended for Phase 3 enhancement (after fingerprint foundation in Phase 2).
3.4 Option D: LLM-Based Verification¶
Approach: Use an LLM to verify whether two markets are equivalent.
Prompt: Are these two markets about the same event?
Market A: "Will Trump buy Greenland?"
Market B: "Will the US acquire part of Greenland in 2026?"
Consider:
1. Are they about the same underlying event?
2. Do they have compatible resolution criteria?
3. Would a "Yes" on one correspond to "Yes" on the other?
Output: { "match": true, "confidence": 0.92, "warnings": ["Different date scopes"] }
| Criterion | Assessment |
|---|---|
| Accuracy | Highest (human-level reasoning) |
| Cost | ~$0.01-0.05 per verification |
| Latency | 200-500ms per call |
| Complexity | Low (API call) |
| Explainability | High (LLM provides reasoning) |
| Production Ready | Yes, for high-value verification |
Use cases: - Final verification before approving high-value matches - Edge cases where fingerprint matching is uncertain - Resolution criteria comparison
Verdict: Recommended for high-confidence final verification.
3.5 Option E: External Service Integration¶
Approach: Use existing matching services (Matchr, Dome) as data sources.
| Service | Integration Method | Data Quality | Dependency Risk |
|---|---|---|---|
| Matchr | Scrape or unofficial API | High (curated) | Medium (no official API) |
| Dome | Official SDK | High | High (paid, availability) |
| pmxt | NPM library | Medium | Low (open source) |
| Criterion | Assessment |
|---|---|
| Accuracy | High (curated by experts) |
| Cost | $0-$500/month depending on service |
| Latency | 100-500ms per query |
| Complexity | Low (API integration) |
| Explainability | Medium (external black box) |
| Production Ready | Yes, as validation source |
Advantages: - Immediate access to curated match database - No matching logic maintenance - Validation source for our own matching
Disadvantages: - External dependency (availability, pricing changes) - May not cover all markets we care about - No customization of matching logic
Verdict: Recommended as validation/fallback source.
4. Recommended Architecture¶
4.1 Three-Stage Pipeline¶
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: Discovery │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Polymarket │ │ Kalshi │ │
│ │ Gamma API │ │ /v2/markets │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Market Enumeration │ │
│ │ - Pagination │ │
│ │ - mve_filter=exclude │ │
│ │ - Category filtering │ │
│ └───────────┬───────────┘ │
└─────────────────────┼───────────────────────────────────────┘
│
┌─────────────────────┼───────────────────────────────────────┐
│ ▼ Stage 2: Matching │
│ ┌───────────────────────┐ │
│ │ Fingerprint Extractor │ │
│ │ - Entity NER │ │
│ │ - Date parsing │ │
│ │ - Threshold parsing │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Candidate Generation │ │
│ │ - Keyword index │ │
│ │ - Date proximity │ │
│ │ - Top N candidates │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Fingerprint Matching │ │
│ │ - Field-by-field │ │
│ │ - Weighted scoring │ │
│ │ - Threshold ≥ 0.70 │ │
│ └───────────┬───────────┘ │
└─────────────────────┼───────────────────────────────────────┘
│
┌─────────────────────┼───────────────────────────────────────┐
│ ▼ Stage 3: Verification │
│ ┌───────────────────────┐ │
│ │ Semantic Warnings │ │
│ │ - Resolution diff │ │
│ │ - Date diff │ │
│ │ - Source diff │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Human Review CLI │ │
│ │ - Fingerprint diff │ │
│ │ - Warning ack │ │
│ │ - Approve/Reject │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Verified Mapping │ │
│ │ - MappingManager │ │
│ │ - Audit log │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
4.2 Fingerprint Schema¶
/// Canonical market fingerprint for cross-platform matching
pub struct MarketFingerprint {
/// Primary entity (person, asset, institution)
pub entity: Entity,
/// Secondary entities (locations, counterparties)
pub secondary_entities: Vec<Entity>,
/// Event classification
pub event_type: EventType,
/// Numeric metric and threshold
pub metric: Option<MetricSpec>,
/// Geographic or jurisdictional scope
pub scope: Option<Scope>,
/// Resolution timing
pub resolution: ResolutionSpec,
/// Outcome structure
pub outcomes: OutcomeSpec,
/// Original market data (for reference)
pub source: SourceData,
}
pub struct Entity {
pub name: String,
pub entity_type: EntityType,
pub aliases: Vec<String>,
}
pub enum EntityType {
Person, // Trump, Biden, Musk
Asset, // Bitcoin, ETH, Gold
Institution, // Fed, FOMC, ECB
Team, // Chiefs, Eagles, Lakers
Location, // Greenland, Ukraine, Taiwan
Event, // Super Bowl, World Series
}
pub enum EventType {
Election,
PriceTarget,
EconomicIndicator,
SportOutcome,
Acquisition,
PolicyDecision,
WeatherEvent,
Other(String),
}
pub struct MetricSpec {
pub name: String,
pub direction: Direction,
pub threshold: f64,
pub unit: Option<String>,
}
pub enum Direction {
Above, // >= threshold
Below, // <= threshold
Between, // within range
Exactly, // == threshold
}
pub struct ResolutionSpec {
pub date: Option<NaiveDate>,
pub time: Option<NaiveTime>,
pub timezone: Option<Tz>,
pub source: Option<String>,
pub criteria: Option<String>,
}
pub struct OutcomeSpec {
pub outcome_type: OutcomeType,
pub outcomes: Vec<String>,
}
pub enum OutcomeType {
Binary, // Yes/No
MultiOutcome, // Multiple options
Range, // Numeric ranges
}
4.3 Matching Algorithm¶
impl FingerprintMatcher {
pub fn match_score(&self, fp1: &MarketFingerprint, fp2: &MarketFingerprint) -> MatchResult {
let mut score = 0.0;
let mut field_scores = HashMap::new();
// Primary entity match (30%)
let entity_score = self.entity_similarity(&fp1.entity, &fp2.entity);
field_scores.insert("entity", entity_score);
score += 0.30 * entity_score;
// Resolution date match (25%)
let date_score = self.date_similarity(&fp1.resolution, &fp2.resolution);
field_scores.insert("date", date_score);
score += 0.25 * date_score;
// Metric/threshold match (20%)
let metric_score = self.metric_similarity(&fp1.metric, &fp2.metric);
field_scores.insert("metric", metric_score);
score += 0.20 * metric_score;
// Outcome structure match (15%)
let outcome_score = self.outcome_similarity(&fp1.outcomes, &fp2.outcomes);
field_scores.insert("outcome", outcome_score);
score += 0.15 * outcome_score;
// Resolution source match (10%)
let source_score = self.source_similarity(&fp1.resolution, &fp2.resolution);
field_scores.insert("source", source_score);
score += 0.10 * source_score;
// Generate warnings
let warnings = self.generate_warnings(fp1, fp2);
MatchResult {
score,
field_scores,
warnings,
is_candidate: score >= self.threshold,
}
}
fn entity_similarity(&self, e1: &Entity, e2: &Entity) -> f64 {
// Exact match
if e1.name.to_lowercase() == e2.name.to_lowercase() {
return 1.0;
}
// Alias match
for alias in &e1.aliases {
if alias.to_lowercase() == e2.name.to_lowercase() {
return 0.95;
}
}
for alias in &e2.aliases {
if alias.to_lowercase() == e1.name.to_lowercase() {
return 0.95;
}
}
// Same type, different entity
if e1.entity_type == e2.entity_type {
// Could add fuzzy string matching here
return 0.0;
}
0.0
}
fn date_similarity(&self, r1: &ResolutionSpec, r2: &ResolutionSpec) -> f64 {
match (&r1.date, &r2.date) {
(Some(d1), Some(d2)) => {
let diff = (*d1 - *d2).num_days().abs();
match diff {
0 => 1.0,
1..=7 => 0.8,
8..=14 => 0.6,
15..=30 => 0.4,
_ => 0.0,
}
}
(None, None) => 0.5, // Both unspecified
_ => 0.2, // One specified, one not
}
}
}
5. Implementation Considerations¶
5.1 Entity Extraction Strategy¶
Phase 1: Rule-Based NER
const ENTITY_PATTERNS: &[(&str, EntityType)] = &[
// Persons
(r"(?i)\b(Trump|Biden|Harris|Obama|Musk|Zuckerberg)\b", EntityType::Person),
// Assets
(r"(?i)\b(Bitcoin|BTC|Ethereum|ETH|Gold|S&P|SPX)\b", EntityType::Asset),
// Institutions
(r"(?i)\b(Fed|FOMC|ECB|BoE|SEC|FTC)\b", EntityType::Institution),
// Economic indicators
(r"(?i)\b(CPI|GDP|NFP|unemployment|inflation)\b", EntityType::EconomicIndicator),
// Sports
(r"(?i)\b(Super Bowl|World Series|NBA Finals|Stanley Cup)\b", EntityType::Event),
// Price targets
(r"\$[\d,]+(?:\.\d+)?(?:k|K|M|B)?", EntityType::PriceTarget),
// Dates/years
(r"(?i)\b(20\d{2}|Q[1-4]|January|February|...)\b", EntityType::Date),
];
Phase 2: ML-Based NER (if rule-based insufficient)
# Using spaCy with custom training
import spacy
nlp = spacy.load("en_core_web_lg")
# Add custom patterns for prediction market entities
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
{"label": "CRYPTO", "pattern": [{"LOWER": {"IN": ["bitcoin", "btc", "ethereum", "eth"]}}]},
{"label": "INDICATOR", "pattern": [{"LOWER": {"IN": ["cpi", "gdp", "nfp", "fomc"]}}]},
]
ruler.add_patterns(patterns)
5.2 Synonym and Alias Handling¶
Build an entity alias database:
lazy_static! {
static ref ENTITY_ALIASES: HashMap<&'static str, Vec<&'static str>> = {
let mut m = HashMap::new();
m.insert("Bitcoin", vec!["BTC", "bitcoin", "₿"]);
m.insert("Ethereum", vec!["ETH", "ethereum", "Ether"]);
m.insert("Super Bowl", vec!["Pro Football Championship", "NFL Championship"]);
m.insert("Trump", vec!["Donald Trump", "President Trump", "DJT"]);
m.insert("Fed", vec!["Federal Reserve", "FOMC", "Jerome Powell"]);
m
};
}
5.3 Date Parsing¶
Handle various date formats:
fn parse_resolution_date(text: &str) -> Option<NaiveDate> {
let patterns = [
// ISO format
r"(\d{4}-\d{2}-\d{2})",
// US format
r"(January|February|...) (\d{1,2}),? (\d{4})",
// Quarter
r"Q([1-4]) (\d{4})",
// End of year
r"end of (\d{4})",
// Before date
r"before (January|February|...) (\d{1,2}),? (\d{4})",
];
for pattern in patterns {
if let Some(caps) = Regex::new(pattern).unwrap().captures(text) {
return parse_captures(&caps);
}
}
None
}
5.4 Performance Optimization¶
Candidate Generation with Inverted Index
pub struct MarketIndex {
// Inverted index: keyword -> market IDs
keyword_index: HashMap<String, Vec<MarketId>>,
// Date index: date -> market IDs
date_index: BTreeMap<NaiveDate, Vec<MarketId>>,
// Entity index: entity -> market IDs
entity_index: HashMap<String, Vec<MarketId>>,
}
impl MarketIndex {
pub fn find_candidates(&self, market: &DiscoveredMarket, limit: usize) -> Vec<MarketId> {
let mut scores: HashMap<MarketId, f32> = HashMap::new();
// Keyword overlap
for keyword in extract_keywords(&market.title) {
if let Some(ids) = self.keyword_index.get(&keyword) {
for id in ids {
*scores.entry(*id).or_default() += 1.0;
}
}
}
// Date proximity boost
if let Some(date) = market.resolution_date {
for (d, ids) in self.date_index.range(date - 14..=date + 14) {
let proximity = 1.0 - ((*d - date).num_days().abs() as f32 / 14.0);
for id in ids {
*scores.entry(*id).or_default() += proximity;
}
}
}
// Return top N by score
let mut candidates: Vec<_> = scores.into_iter().collect();
candidates.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
candidates.into_iter().take(limit).map(|(id, _)| id).collect()
}
}
6. Evaluation Framework¶
6.1 Test Data Set¶
Create a "golden set" of known market pairs for validation:
| ID | Kalshi Market | Polymarket Market | Expected Match |
|---|---|---|---|
| 1 | KXGREENLAND-29 | greenland-2026 | Yes |
| 2 | KXSB-26-KC | super-bowl-2026-chiefs | Yes |
| 3 | KXBTC-100K | btc-100k-2026 | Yes |
| 4 | KXFOMC-JAN26 | fed-rate-cut-jan-2026 | Yes |
| 5 | KXGREENLAND-29 | trump-second-term | No (different event) |
| 6 | KXSB-26-KC | nba-finals-2026 | No (different sport) |
6.2 Metrics¶
| Metric | Definition | Target |
|---|---|---|
| Precision | True matches / All proposed matches | ≥ 95% |
| Recall | True matches / All actual matches | ≥ 80% |
| F1 Score | 2 × (P × R) / (P + R) | ≥ 0.87 |
| False Positive Rate | False matches / All proposed | ≤ 5% |
| Latency | Time per market pair comparison | ≤ 50ms |
6.3 Evaluation Protocol¶
# Run evaluation against golden set
cargo run --features discovery -- --evaluate-matching --golden-set data/golden_pairs.json
# Output:
# Precision: 96.2%
# Recall: 82.4%
# F1 Score: 0.888
# False Positive Rate: 3.8%
# Average Latency: 12.3ms
7. Phase 3: Embedding-Based Semantic Matching¶
7.1 Overview¶
Embedding-based matching captures semantic similarity that fingerprint matching may miss. By representing market titles as dense vectors in a semantic space, we can identify matches even when there's no lexical overlap.
Key Insight: Embeddings trained on general text understand that "Super Bowl" and "Pro Football Championship" are semantically related, even though they share no words.
7.2 Model Selection¶
Candidate Models¶
| Model | Dimensions | Latency | Domain Fit | Cost |
|---|---|---|---|---|
all-MiniLM-L6-v2 |
384 | 15ms | Medium | Free (local) |
all-mpnet-base-v2 |
768 | 40ms | High | Free (local) |
text-embedding-3-small |
1536 | 50ms | High | $0.00002/1K tokens |
voyage-finance-2 |
1024 | 60ms | High (finance) | $0.00012/1K tokens |
e5-large-v2 |
1024 | 35ms | High | Free (local) |
Selection Criteria¶
def evaluate_model(model_name: str, golden_pairs: list) -> ModelMetrics:
"""Evaluate embedding model on prediction market pairs."""
model = load_model(model_name)
# Compute embeddings
embeddings = {}
for pair in golden_pairs:
embeddings[pair.kalshi_id] = model.encode(pair.kalshi_title)
embeddings[pair.poly_id] = model.encode(pair.poly_title)
# Calculate metrics
true_positives = 0
false_positives = 0
for pair in golden_pairs:
sim = cosine_similarity(
embeddings[pair.kalshi_id],
embeddings[pair.poly_id]
)
if pair.is_match:
if sim >= 0.70:
true_positives += 1
else:
if sim >= 0.70:
false_positives += 1
return ModelMetrics(
precision=true_positives / (true_positives + false_positives),
recall=true_positives / sum(p.is_match for p in golden_pairs),
avg_latency=measure_latency(model)
)
Recommended Model¶
Primary: all-mpnet-base-v2 for local deployment (best accuracy/latency tradeoff)
Alternative: text-embedding-3-small if API latency is acceptable
7.3 Vector Storage Architecture¶
Option A: SQLite with sqlite-vec (Simple)¶
-- Schema extension for embeddings
CREATE VIRTUAL TABLE market_embeddings USING vec0(
market_id TEXT PRIMARY KEY,
embedding FLOAT[768] -- Match model dimensions
);
-- Fast ANN search
SELECT market_id, distance
FROM market_embeddings
WHERE embedding MATCH ?
AND k = 50 -- Top 50 candidates
ORDER BY distance;
Pros: Simple, single-file database, no additional infrastructure Cons: In-memory index, limited scalability
Option B: PostgreSQL with pgvector (Production)¶
-- Enable extension
CREATE EXTENSION vector;
-- Add embedding column
ALTER TABLE discovered_markets
ADD COLUMN embedding vector(768);
-- Create IVFFlat index for ANN search
CREATE INDEX ON discovered_markets
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Query similar markets
SELECT platform_id, title,
1 - (embedding <=> query_embedding) AS similarity
FROM discovered_markets
WHERE platform = 'polymarket'
ORDER BY embedding <=> query_embedding
LIMIT 50;
Pros: Scalable, mature, supports filtering during search Cons: Requires PostgreSQL infrastructure
7.4 Hybrid Scoring Algorithm¶
/// Combine fingerprint, embedding, and text similarity scores
pub struct HybridMatcher {
fingerprint_matcher: FingerprintMatcher,
embedding_matcher: EmbeddingMatcher,
text_matcher: TextSimilarityMatcher,
// Configurable weights (tuned via feedback)
weights: HybridWeights,
}
#[derive(Clone)]
pub struct HybridWeights {
pub fingerprint: f64, // Default: 0.50
pub embedding: f64, // Default: 0.40
pub text: f64, // Default: 0.10
}
impl HybridMatcher {
pub async fn score(&self, kalshi: &Market, poly: &Market) -> HybridScore {
// Run all matchers in parallel
let (fp_score, emb_score, text_score) = tokio::join!(
self.fingerprint_matcher.score(kalshi, poly),
self.embedding_matcher.score(kalshi, poly),
self.text_matcher.score(kalshi, poly),
);
let combined = self.weights.fingerprint * fp_score.score
+ self.weights.embedding * emb_score.similarity
+ self.weights.text * text_score.combined;
HybridScore {
combined,
fingerprint: fp_score,
embedding: emb_score,
text: text_score,
is_candidate: combined >= 0.70,
}
}
}
7.5 Confidence Calibration¶
Raw similarity scores need calibration to meaningful confidence levels:
from sklearn.isotonic import IsotonicRegression
class ConfidenceCalibrator:
def __init__(self):
self.calibrator = IsotonicRegression(out_of_bounds='clip')
def fit(self, scores: list[float], labels: list[bool]):
"""Fit calibrator on historical match decisions."""
self.calibrator.fit(scores, [1.0 if l else 0.0 for l in labels])
def calibrate(self, score: float) -> float:
"""Convert raw score to calibrated probability."""
return self.calibrator.predict([score])[0]
Target Calibration: A score of 0.80 should mean "80% of pairs with this score are true matches"
7.6 Fine-Tuning Pipeline¶
When sufficient training data is available (500+ pairs), fine-tune the embedding model:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
def fine_tune_on_matches(
base_model: str,
positive_pairs: list[tuple[str, str]],
negative_pairs: list[tuple[str, str]],
output_path: str
):
"""Fine-tune embedding model on prediction market pairs."""
model = SentenceTransformer(base_model)
# Create training examples
train_examples = []
for k_title, p_title in positive_pairs:
train_examples.append(InputExample(texts=[k_title, p_title], label=1.0))
for k_title, p_title in negative_pairs:
train_examples.append(InputExample(texts=[k_title, p_title], label=0.0))
# Use contrastive loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.ContrastiveLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path=output_path
)
return model
Expected Improvement: +5-10% F1 score on domain-specific pairs after fine-tuning
8. Phase 4: LLM-Based Verification¶
8.1 Overview¶
LLM verification provides human-level reasoning for complex cases where algorithmic matching is uncertain. It excels at: - Understanding paraphrased questions - Comparing resolution criteria semantically - Identifying subtle differences in scope or timing
8.2 Verification Prompt Engineering¶
Primary Verification Prompt¶
<system>
You are an expert analyst for prediction markets. Your task is to determine if two markets from different platforms are semantically equivalent - meaning they will resolve the same way for the same real-world outcome.
</system>
<user>
Compare these two prediction markets:
**MARKET A (Kalshi)**
- Title: {kalshi_title}
- Resolution Criteria: {kalshi_rules}
- Expiration: {kalshi_expiration}
**MARKET B (Polymarket)**
- Title: {poly_title}
- Resolution Criteria: {poly_rules}
- Expiration: {poly_expiration}
Analyze the following:
1. **Same Event?** Are both markets about the identical real-world event (not just similar topics)?
2. **Outcome Alignment?** Would "Yes" on Market A always correspond to "Yes" on Market B?
3. **Resolution Compatibility?** Are the resolution criteria functionally equivalent?
4. **Timing Differences?** Could different resolution timing cause different outcomes?
5. **Scope Differences?** Do they cover the same geographic/temporal/jurisdictional scope?
Respond in JSON format:
{
"equivalent": true|false,
"confidence": 0.0-1.0,
"same_event": true|false,
"outcome_aligned": true|false,
"resolution_compatible": true|false,
"reasoning": "Brief explanation",
"warnings": ["List of potential issues"],
"resolution_differences": ["Specific criteria differences if any"]
}
</user>
Resolution Deep-Dive Prompt (for complex cases)¶
<system>
You are a legal analyst specializing in prediction market resolution criteria. Analyze resolution clauses for semantic equivalence.
</system>
<user>
Compare these resolution criteria in detail:
**Criteria A:**
{criteria_a}
**Criteria B:**
{criteria_b}
Analyze:
1. **Resolution Source**: Who/what determines the outcome? Same authority?
2. **Resolution Timing**: When is the outcome determined? Same timeframe?
3. **Threshold Definition**: What constitutes Yes vs No? Same threshold?
4. **Edge Cases**: How are ambiguous situations handled? Compatible?
5. **Invalidation Conditions**: What causes market cancellation? Same conditions?
Provide structured comparison with compatibility score (0-100).
</user>
8.3 Cost-Optimized Invocation Strategy¶
Tiered Model Selection¶
pub struct LlmVerifier {
haiku_client: AnthropicClient, // ~$0.001/verification
sonnet_client: AnthropicClient, // ~$0.01/verification
daily_budget: AtomicU64,
daily_spend: AtomicU64,
}
impl LlmVerifier {
pub async fn verify(&self, pair: &CandidateMatch) -> Result<LlmResult, Error> {
// Check budget
if self.daily_spend.load(Ordering::SeqCst) >= self.daily_budget.load(Ordering::SeqCst) {
return Err(Error::BudgetExceeded);
}
// Use Haiku for initial screening
let haiku_result = self.verify_with_haiku(pair).await?;
// Escalate to Sonnet if uncertain or high-value
if haiku_result.confidence < 0.85 || pair.estimated_volume > 10_000.0 {
let sonnet_result = self.verify_with_sonnet(pair).await?;
return Ok(sonnet_result);
}
Ok(haiku_result)
}
}
Invocation Rules¶
| Condition | Action | Estimated Cost |
|---|---|---|
| Fingerprint score < 0.60 | Skip LLM (reject) | $0 |
| Fingerprint score 0.60-0.85 | Invoke Haiku | $0.001 |
| Fingerprint score > 0.85 | Skip LLM (approve) | $0 |
| Haiku uncertain (<0.85) | Escalate to Sonnet | $0.01 |
| High-value market (>$10k volume) | Always use Sonnet | $0.01 |
| Semantic warnings present | Always use Sonnet | $0.01 |
Budget Management¶
pub struct BudgetManager {
daily_limit_cents: u64,
current_spend_cents: AtomicU64,
last_reset: AtomicU64,
}
impl BudgetManager {
pub fn can_spend(&self, amount_cents: u64) -> bool {
self.maybe_reset_daily();
let current = self.current_spend_cents.load(Ordering::SeqCst);
current + amount_cents <= self.daily_limit_cents
}
pub fn record_spend(&self, amount_cents: u64) {
self.current_spend_cents.fetch_add(amount_cents, Ordering::SeqCst);
}
}
Default Budget: $50/day (~5,000 Haiku calls or ~500 Sonnet calls)
8.4 Response Parsing and Validation¶
#[derive(Deserialize, Debug)]
pub struct LlmVerificationResult {
pub equivalent: bool,
pub confidence: f64,
pub same_event: bool,
pub outcome_aligned: bool,
pub resolution_compatible: bool,
pub reasoning: String,
pub warnings: Vec<String>,
pub resolution_differences: Vec<String>,
}
impl LlmVerificationResult {
/// Validate LLM response for consistency
pub fn validate(&self) -> Result<(), ValidationError> {
// Confidence must be between 0 and 1
if self.confidence < 0.0 || self.confidence > 1.0 {
return Err(ValidationError::InvalidConfidence);
}
// If equivalent is true, all sub-checks should be true
if self.equivalent && (!self.same_event || !self.outcome_aligned) {
return Err(ValidationError::InconsistentFlags);
}
// Must have reasoning
if self.reasoning.is_empty() {
return Err(ValidationError::MissingReasoning);
}
Ok(())
}
/// Convert to human-readable report
pub fn to_report(&self) -> String {
format!(
"Equivalent: {} (confidence: {:.0}%)\n\
Reasoning: {}\n\
Warnings: {}\n\
Resolution Differences: {}",
if self.equivalent { "Yes" } else { "No" },
self.confidence * 100.0,
self.reasoning,
self.warnings.join(", "),
self.resolution_differences.join("; ")
)
}
}
8.5 Human Review of LLM Decisions¶
Initially, all LLM-verified matches require human confirmation:
┌─────────────────────────────────────────────────────────────┐
│ LLM Verification Result │
├─────────────────────────────────────────────────────────────┤
│ Kalshi: "Will Trump buy Greenland?" │
│ Polymarket: "Will the US acquire part of Greenland?" │
│ │
│ LLM Says: EQUIVALENT (confidence: 92%) │
│ │
│ Reasoning: Both markets resolve on US acquisition of │
│ Greenland territory. Kalshi frames as "Trump" action, │
│ Polymarket as "US" action, but resolution criteria │
│ both require actual transfer of territory. │
│ │
│ Warnings: │
│ - Different expiration dates (2029 vs 2026) │
│ │
│ [✓ Approve] [✗ Reject] [🔍 View Details] │
└─────────────────────────────────────────────────────────────┘
Auto-Approval Criteria (after calibration)¶
After 100+ LLM decisions have been human-reviewed:
impl AutoApprovalPolicy {
pub fn can_auto_approve(&self, result: &LlmVerificationResult) -> bool {
// High confidence
if result.confidence < 0.95 {
return false;
}
// No warnings
if !result.warnings.is_empty() {
return false;
}
// No resolution differences
if !result.resolution_differences.is_empty() {
return false;
}
// Historical accuracy check
if self.llm_historical_accuracy() < 0.98 {
return false;
}
true
}
}
9. Phase 5: Reinforcement Learning from Human Feedback¶
9.1 Overview¶
Human approval decisions are a rich source of training data. By systematically capturing and learning from these decisions, we can continuously improve all matching components.
Key Insight: Every human approval/rejection is a labeled training example that improves future matching accuracy.
9.2 Feedback Data Schema¶
CREATE TABLE match_decisions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
candidate_id UUID REFERENCES candidates(id),
created_at TIMESTAMP DEFAULT now(),
-- Decision
decision TEXT CHECK (decision IN ('approved', 'rejected', 'modified')),
reviewer_id TEXT NOT NULL,
-- Context at decision time (for ML features)
fingerprint_score REAL,
embedding_similarity REAL,
text_similarity REAL,
llm_confidence REAL,
llm_equivalent BOOLEAN,
-- Human feedback
rejection_reason TEXT,
modification_notes TEXT,
-- Entity corrections (for alias learning)
entity_corrections JSONB,
-- Example: {"kalshi_entity": "BTC", "poly_entity": "Bitcoin", "canonical": "Bitcoin"}
-- Resolution analysis
resolution_compatible BOOLEAN,
resolution_notes TEXT,
-- Training flags
include_in_training BOOLEAN DEFAULT true,
training_weight REAL DEFAULT 1.0 -- Higher for difficult cases
);
CREATE INDEX idx_decisions_training ON match_decisions(include_in_training)
WHERE include_in_training = true;
9.3 Feedback Collection Pipeline¶
pub struct FeedbackCollector {
storage: Arc<Storage>,
}
impl FeedbackCollector {
/// Record a human decision with all context
pub async fn record_decision(
&self,
candidate: &CandidateMatch,
decision: Decision,
reviewer: &str,
) -> Result<Uuid, Error> {
let feedback = MatchDecision {
id: Uuid::new_v4(),
candidate_id: candidate.id,
decision: decision.verdict,
reviewer_id: reviewer.to_string(),
// Capture all scores for feature analysis
fingerprint_score: candidate.fingerprint_score,
embedding_similarity: candidate.embedding_similarity,
text_similarity: candidate.text_score,
llm_confidence: candidate.llm_result.as_ref().map(|r| r.confidence),
llm_equivalent: candidate.llm_result.as_ref().map(|r| r.equivalent),
// Human feedback
rejection_reason: decision.rejection_reason,
entity_corrections: decision.entity_corrections,
resolution_notes: decision.resolution_notes,
include_in_training: true,
training_weight: self.compute_training_weight(&candidate, &decision),
};
self.storage.insert_decision(&feedback).await?;
// Trigger incremental learning if enabled
if self.incremental_learning_enabled {
self.trigger_incremental_update(&feedback).await?;
}
Ok(feedback.id)
}
/// Compute training weight (prioritize difficult/educational cases)
fn compute_training_weight(&self, candidate: &CandidateMatch, decision: &Decision) -> f64 {
let mut weight = 1.0;
// Hard negatives (high score but rejected) are valuable
if decision.verdict == "rejected" && candidate.fingerprint_score > 0.6 {
weight *= 2.0;
}
// Cases with entity corrections are valuable
if decision.entity_corrections.is_some() {
weight *= 1.5;
}
// Edge cases near threshold are valuable
if (candidate.fingerprint_score - 0.70).abs() < 0.1 {
weight *= 1.5;
}
weight.min(5.0) // Cap at 5x
}
}
9.4 Automatic Alias Learning¶
pub struct AliasLearner {
alias_db: Arc<RwLock<AliasDatabase>>,
}
impl AliasLearner {
/// Learn from approved matches with entity differences
pub async fn learn_from_approval(&self, decision: &MatchDecision) {
// Learn from explicit corrections
if let Some(corrections) = &decision.entity_corrections {
for correction in corrections {
self.add_alias(
&correction.canonical,
&correction.kalshi_entity,
).await;
self.add_alias(
&correction.canonical,
&correction.poly_entity,
).await;
}
}
// Learn implicit aliases from matched entity pairs
let kalshi_entities = self.extract_entities(&decision.kalshi_title);
let poly_entities = self.extract_entities(&decision.poly_title);
for (k_ent, p_ent) in self.align_entities(&kalshi_entities, &poly_entities) {
if k_ent.name != p_ent.name
&& k_ent.entity_type == p_ent.entity_type
&& self.string_similarity(&k_ent.name, &p_ent.name) < 0.5
{
// Different strings, same type, low similarity = alias
log::info!("Learned alias: {} <-> {}", k_ent.name, p_ent.name);
self.add_bidirectional_alias(&k_ent.name, &p_ent.name).await;
}
}
}
async fn add_alias(&self, canonical: &str, alias: &str) {
let mut db = self.alias_db.write().await;
db.add(canonical, alias);
}
}
9.5 Fingerprint Weight Optimization¶
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
def optimize_weights(decisions: list[MatchDecision]) -> dict[str, float]:
"""
Use logistic regression to find optimal field weights.
Each decision provides features (field scores) and label (approved/rejected).
The learned coefficients indicate optimal weights.
"""
# Extract features
X = np.array([
[d.entity_score, d.date_score, d.threshold_score,
d.outcome_score, d.source_score]
for d in decisions
])
# Labels: 1 for approved, 0 for rejected
y = np.array([1 if d.decision == 'approved' else 0 for d in decisions])
# Fit logistic regression
model = LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y)
# Cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
# Extract and normalize weights
raw_weights = np.abs(model.coef_[0])
normalized = raw_weights / raw_weights.sum()
return {
'entity': normalized[0],
'date': normalized[1],
'threshold': normalized[2],
'outcome': normalized[3],
'source': normalized[4],
}
9.6 Embedding Model Retraining¶
def retrain_embedding_model(
base_model_path: str,
decisions: list[MatchDecision],
output_path: str
) -> EvaluationMetrics:
"""
Retrain embedding model on accumulated human decisions.
"""
# Prepare training data
positive_pairs = [
(d.kalshi_title, d.poly_title)
for d in decisions if d.decision == 'approved'
]
negative_pairs = [
(d.kalshi_title, d.poly_title)
for d in decisions if d.decision == 'rejected'
]
# Load base model
model = SentenceTransformer(base_model_path)
# Create training examples with weights
train_examples = []
for d in decisions:
if d.decision == 'approved':
train_examples.append(
InputExample(
texts=[d.kalshi_title, d.poly_title],
label=1.0
)
)
elif d.decision == 'rejected' and d.fingerprint_score > 0.4:
# Include hard negatives only
train_examples.append(
InputExample(
texts=[d.kalshi_title, d.poly_title],
label=0.0
)
)
# Fine-tune
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.ContrastiveLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=2,
warmup_steps=50
)
# Save
model.save(output_path)
# Evaluate on held-out test set
return evaluate_model(output_path, test_pairs)
9.7 Continuous Improvement Pipeline¶
┌─────────────────────────────────────────────────────────────┐
│ Weekly Improvement Cycle │
├─────────────────────────────────────────────────────────────┤
│ │
│ Monday: Data Export │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Export new decisions from past week │ │
│ │ Update golden set with new test cases │ │
│ │ Calculate current metrics baseline │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Tuesday: Model Training │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Retrain embedding model on accumulated decisions │ │
│ │ Optimize fingerprint weights via logistic regression │ │
│ │ Update alias database with learned aliases │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Wednesday: Validation │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Evaluate new models on golden set │ │
│ │ Compare metrics to baseline │ │
│ │ Flag any regressions │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Thursday-Saturday: A/B Testing │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Deploy new model to 10% of traffic │ │
│ │ Monitor precision/recall in production │ │
│ │ Collect additional feedback │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Sunday: Promotion Decision │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ If A/B metrics improve: promote new model to 100% │ │
│ │ If metrics regress: rollback to previous version │ │
│ │ Update model registry with results │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
9.8 Model Versioning and Rollback¶
pub struct ModelRegistry {
storage: Arc<Storage>,
current_version: AtomicU64,
}
impl ModelRegistry {
/// Register a new model version
pub async fn register_version(
&self,
model_type: ModelType,
artifact_path: &str,
metrics: &EvaluationMetrics,
) -> Result<ModelVersion, Error> {
let version = ModelVersion {
id: Uuid::new_v4(),
model_type,
artifact_path: artifact_path.to_string(),
precision: metrics.precision,
recall: metrics.recall,
f1_score: metrics.f1_score,
created_at: Utc::now(),
is_active: false,
training_decisions_count: metrics.training_size,
};
self.storage.insert_model_version(&version).await?;
Ok(version)
}
/// Promote a version to active (with automatic rollback on failure)
pub async fn promote(&self, version_id: Uuid) -> Result<(), Error> {
let previous = self.get_active_version().await?;
// Activate new version
self.storage.set_active_version(version_id).await?;
// Monitor for 1 hour
tokio::time::sleep(Duration::from_secs(3600)).await;
// Check if metrics degraded
let live_metrics = self.collect_live_metrics().await?;
if live_metrics.f1_score < previous.f1_score - 0.02 {
log::warn!("New model degraded metrics, rolling back");
self.storage.set_active_version(previous.id).await?;
return Err(Error::RollbackTriggered);
}
Ok(())
}
}
10. Operational Excellence¶
This section covers deployment, monitoring, security, and operational considerations for production deployment of the market discovery system.
Availability Target: 99.9% uptime (43.8 minutes/month downtime allowed)
10.1 Deployment Architecture¶
The discovery feature deploys as part of the Trading Core ECS service with optional feature flag enablement:
┌─────────────────────────────────────────────────────────────┐
│ AWS Region │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ECS Cluster │ │
│ │ ┌───────────────────┐ ┌───────────────────┐ │ │
│ │ │ Trading Core │ │ Trading Core │ │ │
│ │ │ (--features │ │ (--features │ │ │
│ │ │ discovery) │ │ discovery) │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │
│ │ │ │ Scanner │ │ │ │ Scanner │ │ │ │
│ │ │ │ Actor │ │ │ │ Actor │ │ │ │
│ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │
│ │ │ │ SQLite │ │ │ │ SQLite │ │ │ │
│ │ │ │ (local) │ │ │ │ (local) │ │ │ │
│ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │
│ │ └─────────┬─────────┘ └─────────┬─────────┘ │ │
│ │ │ │ │ │
│ └────────────┼──────────────────────┼──────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Aurora PostgreSQL (future) │ │
│ │ (shared state for multi-instance scaling) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ AWS Secrets Manager │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Kalshi Keys │ │ Poly Keys │ │ LLM API Keys│ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Single-Instance MVP: - One scanner active at a time (ECS desired count = 1) - SQLite local storage sufficient for ~10,000 markets - No coordination needed between instances
Multi-Instance (Future Scaling): - PostgreSQL for shared candidate storage - Distributed locking for scan coordination (Redis) - Leader election for single-scanner pattern
10.2 CI/CD Pipeline¶
Feature-Gated Testing:
# .github/workflows/ci.yml
name: CI
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
base-tests:
name: Base Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-action@stable
- run: cargo test --manifest-path arbiter-engine/Cargo.toml
discovery-tests:
name: Discovery Feature Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-action@stable
- name: Run unit tests
run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery
- name: Run integration tests
run: cargo test --manifest-path arbiter-engine/Cargo.toml --features discovery -- --ignored
env:
KALSHI_DEMO_KEY_ID: ${{ secrets.KALSHI_DEMO_KEY_ID }}
KALSHI_DEMO_PRIVATE_KEY: ${{ secrets.KALSHI_DEMO_PRIVATE_KEY }}
security-audit:
name: Security Audit
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: cargo install cargo-audit
- run: cargo audit --manifest-path arbiter-engine/Cargo.toml
Deployment Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Deployment Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ PR │───►│ CI/CD │───►│ Review │ │
│ │ Created │ │ Tests │ │ Required │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Merge │◄───│ Approval │ │
│ │ to main │ │ │ │
│ └────────────┘ └────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ │ Build │ │
│ │ Docker │ │
│ └────────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Staging │ │ E2E │ │ Council │ │
│ │ Deploy │ │ Tests │ │ Review │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ ▼ │
│ ┌────────────┐ │
│ │ Production │ │
│ │ Deploy │ │
│ └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
10.3 Monitoring & Observability¶
CloudWatch Metrics:
| Metric | Type | Description | Dashboard |
|---|---|---|---|
discovery/scan/duration |
Timer | Scan cycle duration | Discovery Health |
discovery/scan/errors |
Counter | Scan failures | Discovery Health |
discovery/candidates/count |
Gauge | Candidates generated | Candidate Funnel |
discovery/candidates/pending |
Gauge | Awaiting review | Candidate Funnel |
discovery/candidates/approved |
Counter | Approved matches | Candidate Funnel |
discovery/candidates/rejected |
Counter | Rejected candidates | Candidate Funnel |
discovery/api/rate_limits |
Counter | Rate limit errors | API Performance |
discovery/api/latency |
Timer | API response time | API Performance |
discovery/approvals/rate |
Gauge | Approval percentage | Quality Metrics |
Structured Logging:
// Example structured log output
tracing::info!(
scan_id = %scan_id,
platform = "polymarket",
markets_fetched = markets.len(),
candidates_generated = candidates.len(),
duration_ms = elapsed.as_millis(),
"Discovery scan completed"
);
Log Levels:
- ERROR: Scan failures, API errors, database corruption
- WARN: Rate limit warnings, retry attempts, degraded mode
- INFO: Scan completions, candidate counts, approval decisions
- DEBUG: Individual market processing, matching scores
Dashboard Panels:
- Discovery Health Overview
- Scan success rate (24h rolling)
- Average scan duration
-
Error count by type
-
Candidate Funnel
- Generated → Pending → Approved/Rejected
- Conversion rates
-
Time in pending state
-
API Performance
- Latency p50/p95/p99 by platform
- Rate limit error rate
- Request volume
10.4 Security Hardening¶
API Security:
| Platform | Authentication | Credential Storage | Rotation |
|---|---|---|---|
| Polymarket Gamma | None (public) | N/A | N/A |
| Kalshi | RSA-PSS | AWS Secrets Manager | 90 days |
| LLM APIs (Phase 4) | API Key | AWS Secrets Manager | 30 days |
Rate Limiting: - Token bucket implementation prevents API abuse - Configurable limits per platform - Automatic backoff on 429 responses
Audit Trail (FR-MD-009):
{
"timestamp": "2026-01-23T10:30:00Z",
"event_type": "candidate_approved",
"candidate_id": "uuid-here",
"reviewer_id": "operator@example.com",
"kalshi_market": "KXGREENLAND-29",
"poly_market": "greenland-2026",
"fingerprint_score": 0.82,
"warnings_acknowledged": ["Different expiration dates"],
"decision_notes": "Verified resolution criteria compatible"
}
Access Control (Future): - Discovery CLI requires shell access (current) - RBAC for approval workflow (Phase 2) - Audit trail for all decisions
10.5 Scaling Strategy¶
Phase 1 (MVP - Current):
| Parameter | Value |
|---|---|
| Scanner instances | 1 |
| Markets per platform | ~2,000 |
| Scan interval | 1 hour |
| Storage | SQLite (local) |
| Estimated cost | ~$50/month (ECS) |
Phase 2 (Scale):
| Parameter | Value |
|---|---|
| Scanner instances | 2-3 (leader election) |
| Markets per platform | ~10,000 |
| Scan interval | 15 minutes |
| Storage | PostgreSQL (Aurora) |
| Estimated cost | ~$200/month |
Phase 3 (Embedding):
| Parameter | Value |
|---|---|
| Embedding service | Separate container |
| Vector database | pgvector extension |
| GPU acceleration | Optional (batch jobs) |
| Estimated cost | ~$50/month additional |
Phase 4 (LLM):
| Parameter | Value |
|---|---|
| LLM service | Claude API |
| Budget controls | $50/day default |
| Caching | Response cache (24h) |
| Estimated cost | ~$50-150/month |
10.6 Disaster Recovery¶
Backup Strategy:
| Data | Backup Frequency | Retention | Storage |
|---|---|---|---|
| SQLite DB | Daily | 30 days | S3 Standard |
| Audit logs | Hourly | 90 days hot, 7 years cold | S3 + Glacier |
| Configuration | On change | 1 year | S3 + Git |
Recovery Procedures:
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Scanner crash | 5 min | 0 | ECS auto-restart, health checks |
| Database corruption | 30 min | 24 hr | Restore from S3, verify integrity |
| API outage (external) | N/A | N/A | Graceful degradation, alerts |
| Region failure | 4 hr | 1 hr | Cross-region restore, DNS failover |
Graceful Degradation Modes:
- Polymarket API Down:
- Continue scanning Kalshi only
- Alert on-call
-
Retry with exponential backoff
-
Kalshi API Down:
- Continue scanning Polymarket only
- Alert on-call
-
Retry with exponential backoff
-
Database Unavailable:
- Enter read-only mode
- Serve cached candidates
-
Alert P1
-
Embedding Service Down (Phase 3):
- Fallback to fingerprint-only matching
- Log degraded mode
- No data loss
11. Risk Analysis¶
11.1 Technical Risks¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Entity extraction misses novel entities | Medium | Medium | Extensible pattern system, ML fallback |
| False positive matches lead to bad trades | Low | High | Human verification required (FR-MD-003) |
| API rate limiting blocks discovery | Medium | Low | Configurable backoff, caching |
| Resolution criteria differ subtly | High | High | Semantic warning system |
11.2 Operational Risks¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| External service (Matchr/Dome) becomes unavailable | Medium | Low | Local matching is primary |
| Market format changes break parsing | Low | Medium | Robust error handling, alerts |
| High volume of candidates overwhelms reviewers | Medium | Medium | Confidence thresholds, batching |
11.3 Business Risks¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Competitors adopt better matching | Medium | Medium | Modular architecture allows upgrades |
| Platform TOS prohibit cross-platform arbitrage | Low | High | Legal review, compliance monitoring |
12. Conclusion¶
12.1 Recommended Approach¶
Based on our analysis, we recommend a five-phase approach with progressively sophisticated matching:
- Phase 1: Text similarity matching ✅ (Implemented)
- Phase 2: Fingerprint-based matching with rule-based NER (Proposed)
- Phase 3: Embedding-based semantic matching (hybrid scoring) (Proposed)
- Phase 4: LLM verification for uncertain/high-value matches (Proposed)
- Phase 5: Continuous improvement via human feedback learning (Proposed)
This approach: - Addresses the fundamental failure of text similarity - Aligns with industry best practices (entity extraction) - Preserves human-in-the-loop safety requirements - Creates a virtuous cycle where human decisions improve future matching - Provides graceful degradation (each phase works independently)
12.2 Implementation Priority¶
| Phase | Scope | Priority | Effort |
|---|---|---|---|
| 2a | Fingerprint schema + rule-based NER | Must | Medium |
| 2b | Fingerprint matcher + weighted scoring | Must | Medium |
| 2c | Golden set validation + tuning | Must | Low |
| 3a | Embedding infrastructure + model selection | Should | Medium |
| 3b | Hybrid scoring integration | Should | Medium |
| 3c | Embedding fine-tuning pipeline | Could | High |
| 4a | LLM verification prompts + integration | Should | Medium |
| 4b | Automated escalation rules | Should | Low |
| 4c | Resolution deep analysis | Could | Medium |
| 5a | Feedback data collection | Should | Low |
| 5b | Automatic alias/weight learning | Should | Medium |
| 5c | Continuous retraining pipeline | Could | High |
12.3 Success Criteria¶
| Metric | Phase 2 Target | Phase 3+ Target |
|---|---|---|
| Recall | ≥ 70% | ≥ 85% |
| Precision | ≥ 90% | ≥ 95% |
| F1 Score | ≥ 0.78 | ≥ 0.90 |
| Latency (p99) | ≤ 50ms | ≤ 200ms |
| Human verification | 100% | 100% (safety preserved) |
12.4 Key Innovation: Learning from Human Decisions¶
The most significant architectural decision is treating human approvals/rejections as training data:
┌─────────────────┐
│ Human Reviews │
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌───────────────┐
│ Alias Updates │ │ Weight Tuning │ │ Model Retrain │
└───────────────┘ └─────────────────┘ └───────────────┘
│ │ │
└────────────────────┼────────────────────┘
▼
┌─────────────────┐
│ Improved Models │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Better Matches │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Human Reviews │ ← (cycle continues)
└─────────────────┘
This creates a data flywheel where each human decision makes the system smarter, reducing future human workload while maintaining safety.
13. References¶
Industry Tools¶
- pmxt - Unified API for prediction markets
- Dome API - Developer infrastructure
- Matchr - Cross-platform aggregator
- EventArb - Arbitrage calculator
- Polymarket-Kalshi-Arbitrage-Bot
Research¶
- Awesome Prediction Market Tools
- Semantic Non-Fungibility research
- Semantic Trading research
- Prediction Market Arbitrage Guide
API Documentation¶
Appendix A: Entity Pattern Reference¶
// Full entity pattern list for rule-based NER
const ENTITY_PATTERNS: &[(&str, EntityType)] = &[
// Politicians
(r"(?i)\bTrump\b", EntityType::Person),
(r"(?i)\bBiden\b", EntityType::Person),
(r"(?i)\bHarris\b", EntityType::Person),
(r"(?i)\bObama\b", EntityType::Person),
(r"(?i)\bDeSantis\b", EntityType::Person),
(r"(?i)\bNewsom\b", EntityType::Person),
// Tech figures
(r"(?i)\bMusk\b", EntityType::Person),
(r"(?i)\bZuckerberg\b", EntityType::Person),
(r"(?i)\bAltman\b", EntityType::Person),
// Cryptocurrencies
(r"(?i)\b(Bitcoin|BTC)\b", EntityType::Asset),
(r"(?i)\b(Ethereum|ETH)\b", EntityType::Asset),
(r"(?i)\b(Solana|SOL)\b", EntityType::Asset),
(r"(?i)\b(XRP|Ripple)\b", EntityType::Asset),
// Stocks/Indices
(r"(?i)\b(S&P|SPX|SPY)\b", EntityType::Asset),
(r"(?i)\b(Nasdaq|QQQ)\b", EntityType::Asset),
(r"(?i)\b(Tesla|TSLA)\b", EntityType::Asset),
(r"(?i)\b(Nvidia|NVDA)\b", EntityType::Asset),
// Central banks
(r"(?i)\b(Fed|Federal Reserve|FOMC)\b", EntityType::Institution),
(r"(?i)\b(ECB)\b", EntityType::Institution),
(r"(?i)\b(BoE|Bank of England)\b", EntityType::Institution),
(r"(?i)\b(BoJ|Bank of Japan)\b", EntityType::Institution),
// Economic indicators
(r"(?i)\bCPI\b", EntityType::EconomicIndicator),
(r"(?i)\bGDP\b", EntityType::EconomicIndicator),
(r"(?i)\bNFP\b", EntityType::EconomicIndicator),
(r"(?i)\b(unemployment|jobless)\b", EntityType::EconomicIndicator),
(r"(?i)\binflation\b", EntityType::EconomicIndicator),
// Sports events
(r"(?i)\bSuper Bowl\b", EntityType::Event),
(r"(?i)\bWorld Series\b", EntityType::Event),
(r"(?i)\bNBA Finals\b", EntityType::Event),
(r"(?i)\bStanley Cup\b", EntityType::Event),
(r"(?i)\bWorld Cup\b", EntityType::Event),
// Locations
(r"(?i)\bGreenland\b", EntityType::Location),
(r"(?i)\bUkraine\b", EntityType::Location),
(r"(?i)\bTaiwan\b", EntityType::Location),
(r"(?i)\bPanama\b", EntityType::Location),
// Price targets
(r"\$[\d,]+(?:\.\d+)?(?:k|K|M|B)?", EntityType::PriceTarget),
// Dates
(r"(?i)\b(20\d{2})\b", EntityType::Year),
(r"(?i)\bQ[1-4]\b", EntityType::Quarter),
];
Appendix B: Sample Fingerprint Extraction¶
Input (Kalshi):
Title: "Will Trump buy Greenland?"
Rules: "Resolves Yes if US purchases at least part of Greenland from Denmark before January 20, 2029"
Output:
{
"entity": {
"name": "Trump",
"entity_type": "Person",
"aliases": ["Donald Trump", "DJT"]
},
"secondary_entities": [
{ "name": "Greenland", "entity_type": "Location" },
{ "name": "Denmark", "entity_type": "Location" },
{ "name": "US", "entity_type": "Location" }
],
"event_type": "Acquisition",
"metric": null,
"scope": { "region": "US", "jurisdiction": "Federal" },
"resolution": {
"date": "2029-01-20",
"timezone": null,
"source": null,
"criteria": "US purchases at least part of Greenland from Denmark"
},
"outcomes": {
"outcome_type": "Binary",
"outcomes": ["Yes", "No"]
}
}
End of White Paper