Arithmetic Backtesting from Stored Components

The Agent Framework PR #28 rebalance had to be validated before merging. The tempting approach was to run the 64 most recent signals back through Grok, Perplexity, Metaculus, and Manifold with the new weights applied. That would have taken two hours and burned API quota.

The actual approach took five minutes.

The Insight

Agent Framework persisted every component value alongside every signal. The database already contained, for each signal, the exact grok_score, perplexity_score, news_score, and calibration_score that went into the original composite. The composite was a deterministic weighted sum. Which means a rebalance is pure arithmetic:

SELECT
  signal_id,
  (grok_score * 35.0 / 30.0) AS new_grok,
  (perplexity_score * 50.0 / 30.0) AS new_perplexity,
  news_score AS new_news,
  (calibration_score * 20.0 / 25.0) AS new_calibration,
  (grok_score * 35.0 / 30.0)
    + (perplexity_score * 50.0 / 30.0)
    + news_score
    + (calibration_score * 20.0 / 25.0) AS new_composite
FROM agent_signals
WHERE created_at > NOW() - INTERVAL '14 days'
ORDER BY new_composite DESC;

One query. Every signal in the last two weeks. Old composite and new composite side by side. Filter by new_composite > 70 to count how many cleared the threshold. Zero API calls. Zero pipeline reruns.

Inline Diagram — The Two Paths

The Prerequisite

Arithmetic backtesting only works if the scoring system has been designed for it. Two conditions must hold:

Determinism. Given the same component values, the composite must always produce the same result. No randomness, no time-dependent modifiers, no hidden state.
Persistence. Every component's individual contribution is stored alongside the final composite, not just the composite alone.

Both conditions are cheap to meet. Determinism is usually free — you would have to go out of your way to make scoring nondeterministic. Persistence is one extra column per component in your signals table, or a single JSON blob if the components are variable. The storage cost is trivial compared to the optionality it unlocks.

Where This Pattern Applies

This is not just a signal-engine trick. It works for any scoring, ranking, or classification system where the final score is a deterministic function of component inputs. The perp bot's 9-factor signal stack, Foresight's conviction tiers, the sports bot's calibration Brier scores — all of them can be arithmetically backtested from stored components. The InDecision engine's conviction inversion was caught by a similar arithmetic path: correlate stored labels with actual PnL, no rerun required. (Foresight and InDecision are examples from the production fleet this track is built on — map the pattern to your own scoring systems.)

The Rule

If the system is deterministic and the components are persisted, backtests are SQL queries. Never burn API quota to validate a change that can be computed from data you already have. And if your current scoring system does not persist components, fix that before you ship the next rebalance. You are paying the storage cost once to unlock every future backtest for free.