ASK KNOX
beta
LESSON 268

Semantic Market Matching for Calibration

The Metaculus/Manifold calibrator pattern. Embedding-based matching with all-MiniLM-L6-v2, a 0.60 cosine threshold, semantic reranking of top candidates. Why it works 3.8% of the time for political markets and what that means for calibration strategy.

8 min read

The Hermes calibrator is a semantic matching pipeline. Given a Polymarket political question like "Will Donald Trump win the 2026 midterm?", it searches Metaculus and Manifold for equivalent community-forecasted questions and uses their probabilities as an external calibration anchor. The design is clean. The execution works. And it fires on only 3.8% of signals because the data sources are sparse — not because the pipeline is broken.

The Pipeline

  1. Fetch the Polymarket question text — the title, the description, and the outcome labels.
  2. Embed the question using all-MiniLM-L6-v2 via sentence-transformers. Produces a 384-dimensional vector.
  3. Query Metaculus and Manifold for all active questions in the relevant categories (politics, elections, US governance).
  4. Embed every candidate question with the same model. Cache aggressively — candidate embeddings can be stored and reused across signal evaluations.
  5. Compute cosine similarity between the Polymarket vector and every candidate vector. Keep candidates above 0.60.
  6. Semantic rerank the top 5 candidates via a more careful comparison (LLM call or stricter lexical check). This catches false positives from the embedding layer — questions that are syntactically similar but semantically different.
  7. Extract community probability from the top-ranked candidate. Compare to the Hermes internal estimate. Score based on agreement.

The 0.60 threshold was tuned empirically. Below 0.60 the matches are mostly noise. Above 0.60 the semantic rerank has a real candidate to evaluate. The threshold is one of the few tuning knobs in the whole pipeline.

Why 3.8% Fire Rate

The matcher is not broken. The data is sparse. Metaculus and Manifold have strong coverage of:

  • AI and technology forecasting
  • Geopolitical events (war, elections at the country level)
  • Science and research milestones
  • Economic indicators

They have weak coverage of:

  • Individual US House and Senate races
  • State-level ballot initiatives
  • Specific legislative vote outcomes
  • County-level policy questions

Hermes's Polymarket diet is heavily weighted toward the weak-coverage categories — exactly the markets where retail traders create liquidity and Hermes might find edge. The 96.2% no-match rate is a structural consequence of asymmetric coverage, not a bug.

Inline Diagram — Matching Pipeline

CALIBRATOR PIPELINE — SEMANTIC MATCH96.2% NO MATCH — DATA SPARSITYpolitical races are underrepresented on both platforms

The Strategic Implication

The Hermes story has a clean lesson: you can ship a semantically correct calibrator that is mathematically doomed by the distribution of its inputs. Detect that with ceiling analysis (Lesson 259). Mitigate it with rebalancing (Lesson 262). Monitor it with fire-rate alerts (Lesson 258). All three lessons are necessary precisely because a sparse calibrator is not diagnosable by any standard unit test.

The Rule

Semantic matching is a first-class pattern for external calibration. The stack is embedding + cosine filter + semantic rerank. Tune the threshold empirically. Monitor the fire rate obsessively. Treat the output as additive unless the data landscape guarantees high match rates — which, for political markets on Polymarket, it does not.