Arithmetic Backtesting from Stored Components
The Hermes PR #28 backtest took five minutes via SQL instead of two hours via re-running 64 Perplexity and Grok API calls. If your scoring is deterministic and your components are persisted, backtesting a rebalance is arithmetic — not another agent run.
The Hermes PR #28 rebalance had to be validated before merging. The tempting approach was to run the 64 most recent signals back through Grok, Perplexity, Metaculus, and Manifold with the new weights applied. That would have taken two hours and burned API quota.
The actual approach took five minutes.
The Insight
Hermes persisted every component value alongside every signal. The database already contained, for each signal, the exact grok_score, perplexity_score, news_score, and calibration_score that went into the original composite. The composite was a deterministic weighted sum. Which means a rebalance is pure arithmetic:
SELECT
signal_id,
(grok_score * 35.0 / 30.0) AS new_grok,
(perplexity_score * 50.0 / 30.0) AS new_perplexity,
news_score AS new_news,
(calibration_score * 20.0 / 25.0) AS new_calibration,
(grok_score * 35.0 / 30.0)
+ (perplexity_score * 50.0 / 30.0)
+ news_score
+ (calibration_score * 20.0 / 25.0) AS new_composite
FROM hermes_signals
WHERE created_at > NOW() - INTERVAL '14 days'
ORDER BY new_composite DESC;
One query. Every signal in the last two weeks. Old composite and new composite side by side. Filter by new_composite > 70 to count how many cleared the threshold. Zero API calls. Zero pipeline reruns.
Inline Diagram — The Two Paths
The Prerequisite
Arithmetic backtesting only works if the scoring system has been designed for it. Two conditions must hold:
- Determinism. Given the same component values, the composite must always produce the same result. No randomness, no time-dependent modifiers, no hidden state.
- Persistence. Every component's individual contribution is stored alongside the final composite, not just the composite alone.
Both conditions are cheap to meet. Determinism is usually free — you would have to go out of your way to make scoring nondeterministic. Persistence is one extra column per component in your signals table, or a single JSON blob if the components are variable. The storage cost is trivial compared to the optionality it unlocks.
Where This Pattern Applies
This is not just a Hermes trick. It works for any scoring, ranking, or classification system where the final score is a deterministic function of component inputs. Leverage's PolyEdge signal stack, Foresight's conviction tiers, Shiva's calibration Brier scores — all of them can be arithmetically backtested from stored components. The InDecision engine's conviction inversion was caught by a similar arithmetic path: correlate stored labels with actual PnL, no rerun required.
The Rule
If the system is deterministic and the components are persisted, backtests are SQL queries. Never burn API quota to validate a change that can be computed from data you already have. And if your current scoring system does not persist components, fix that before you ship the next rebalance. You are paying the storage cost once to unlock every future backtest for free.