Semantic Market Matching for Calibration

The Agent Framework calibrator is a semantic matching pipeline. Given a Polymarket political question like "Will Republicans control the House after the 2026 midterms?", it searches Metaculus and Manifold for equivalent community-forecasted questions and uses their probabilities as an external calibration anchor. The design is clean. The execution works. And it fires on only 3.8% of signals because the data sources are sparse — not because the pipeline is broken.

The Pipeline

Fetch the Polymarket question text — the title, the description, and the outcome labels.
Embed the question using all-MiniLM-L6-v2 via sentence-transformers. Produces a 384-dimensional vector.
Query Metaculus and Manifold for all active questions in the relevant categories (politics, elections, US governance).
Embed every candidate question with the same model. Cache aggressively — candidate embeddings can be stored and reused across signal evaluations.
Compute cosine similarity between the Polymarket vector and every candidate vector. Keep candidates above 0.60.
Semantic rerank the top 5 candidates via a more careful comparison (LLM call or stricter lexical check). This catches false positives from the embedding layer — questions that are syntactically similar but semantically different.
Extract community probability from the top-ranked candidate. Compare to the Agent Framework internal estimate. Score based on agreement.

The 0.60 threshold was tuned empirically. Below 0.60 the matches are mostly noise. Above 0.60 the semantic rerank has a real candidate to evaluate. The threshold is one of the few tuning knobs in the whole pipeline.

Why 3.8% Fire Rate

The matcher is not broken. The data is sparse. Metaculus and Manifold have strong coverage of:

AI and technology forecasting
Geopolitical events (war, elections at the country level)
Science and research milestones
Economic indicators

They have weak coverage of:

Individual US House and Senate races
State-level ballot initiatives
Specific legislative vote outcomes
County-level policy questions

Agent Framework's Polymarket diet is heavily weighted toward the weak-coverage categories — exactly the markets where retail traders create liquidity and Agent Framework might find edge. The 96.2% no-match rate is a structural consequence of asymmetric coverage, not a bug.

Inline Diagram — Matching Pipeline

The Strategic Implication

The Agent Framework story has a clean lesson: you can ship a semantically correct calibrator that is mathematically doomed by the distribution of its inputs. Detect that with ceiling analysis (the “Ceiling Analysis Before Shipping” lesson in the Quantitative Scoring track). Mitigate it with rebalancing (the “Rebalance Without Moving the Threshold” lesson in that same track). Monitor it with fire-rate alerts (the “Fire-Rate Monitoring” lesson there as well). All three are necessary precisely because a sparse calibrator is not diagnosable by any standard unit test.

The Rule

Semantic matching is a first-class pattern for external calibration. The stack is embedding + cosine filter + semantic rerank. Tune the threshold empirically. Monitor the fire rate obsessively. Treat the output as additive unless the data landscape guarantees high match rates — which, for political markets on Polymarket, it does not.

The Null-Match Fallback

When the pipeline finds no candidate above the 0.60 cosine threshold, the correct behavior is to abstain — score the signal as flat (0.5) and log the null-match event. This is not a pipeline failure. It is a deliberate fallback path that should be treated as a first-class outcome.

The alternative — forcing the nearest-neighbor match regardless of similarity score — is the failure mode. A forced match with cosine 0.42 does not mean "weak signal from a related question." It means the matching layer found nothing useful and picked the least-bad option from a sea of irrelevant questions. Using that probability as a calibration anchor injects structured noise into the downstream score. Over many cycles, forced matches poison the calibration distribution in ways that are difficult to diagnose because each individual match looks plausible.

The diagnostic signature of the forced-match trap: calibration scores cluster slightly above or below 0.5 in a way that feels like signal but produces no lift when you back-test against outcomes. The average anchors are subtly wrong, consistently, in the direction of the nearest-neighbor bias.

The Abstain Pattern

COSINE_THRESHOLD = 0.60

def match_or_abstain(
    query_embedding: list[float],
    candidates: list[dict],
    model,
) -> dict:
    """
    Return the best match if above threshold, otherwise return a flat abstain.
    Never force a match below the threshold.
    """
    if not candidates:
        return {"matched": False, "reason": "no_candidates", "score": 0.5}

    # Compute cosine similarity for each candidate
    similarities = [
        cosine_similarity(query_embedding, c["embedding"])
        for c in candidates
    ]
    best_idx = max(range(len(similarities)), key=lambda i: similarities[i])
    best_score = similarities[best_idx]

    if best_score < COSINE_THRESHOLD:
        return {
            "matched": False,
            "reason": "below_threshold",
            "best_cosine": best_score,
            "score": 0.5,          # flat — no calibration signal
        }

    # Above threshold — proceed to semantic rerank
    best_candidate = candidates[best_idx]
    return {
        "matched": True,
        "source": best_candidate["source"],
        "community_probability": best_candidate["probability"],
        "cosine": best_score,
        "score": best_candidate["probability"],
    }

The score: 0.5 on a null match is intentional. It tells the downstream calibrator "this component abstained — treat it as a neutral contribution." In this scoring system, non-firing components are imputed at 0.5 (a neutral prior — no evidence either way), which is exactly why an explicit 0.5 abstain is arithmetically identical to the component not firing: both contribute 0.5 × weight to the weighted sum. This is preferable to omitting the return value entirely, because downstream consumers do not need special null-handling logic.

Null-Match Rate as a Health Metric

The null-match rate — the fraction of signal evaluations where no confident match was found — is a first-class health metric for the calibrator. It belongs in the same monitoring layer as the fire rate.

class MatchMetrics:
    def __init__(self):
        self.total = 0
        self.matched = 0
        self.null_matched = 0

    def record(self, result: dict) -> None:
        self.total += 1
        if result["matched"]:
            self.matched += 1
        else:
            self.null_matched += 1

    def null_match_rate(self) -> float:
        return self.null_matched / self.total if self.total > 0 else 0.0

    def log_cycle(self, cycle: int) -> None:
        rate = self.null_match_rate()
        logger.info(
            "calibrator_metrics cycle=%d total=%d matched=%d null=%d null_rate=%.3f",
            cycle, self.total, self.matched, self.null_matched, rate,
        )
        if rate > 0.97:
            logger.warning(
                "null_match_rate=%.3f above 97pct — data sources may be stale or offline",
                rate,
            )

Baseline. For Agent Framework's political-market diet, the null-match rate is approximately 96.2% — meaning 96.2% of evaluations find no confident external anchor. That is not alarming; it reflects the data sparsity described earlier in this lesson. The baseline is the reference point.

Alert thresholds. A null-match rate that rises from baseline — say from 96% to 99% — is a signal that data sources may be down, rate-limited, or stale. The pipeline is running but finding nothing at all, which is worse than the usual 96%. Alert on a statistically significant increase above the established baseline.

A null-match rate that falls — say from 96% to 40% — is also a signal, but an optimistic one: either the data sources have improved coverage, or the cosine threshold has been lowered accidentally. Investigate either direction when the rate moves more than a few percentage points from the baseline window.

Why Silent Forced Matches Poison Downstream Calibration

The forced-nearest-neighbor trap is particularly dangerous in a weighted-sum scoring system because the corruption is diffuse. A single forced match with cosine 0.35 produces a community probability from a loosely related question — maybe 0.65 where the true calibration anchor would be 0.48. That 0.17 shift propagates through the weight and ends up as a small positive nudge to the final score. On any single signal, this is invisible noise.

Across 200 evaluations in a cycle, 50% of which are forced matches (hypothetically), the calibration component is consistently biased by 0.05–0.10 in the direction of the nearest-neighbor distribution. The scoring system appears to be working — signals are scored, thresholds are cleared, trades are placed. But the calibration is systematically wrong in a direction that is difficult to reverse-engineer from outcomes alone.

The fix is not better matching. It is a strict abstain-or-match boundary with no forced fallthrough.