Backtest vs Live Divergence
The InDecision 54% to 21.5% drift. Three root causes. Why alignment tests should be a deploy gate on every scoring system. Detection: compare backtest PnL distribution to live PnL distribution nightly, before the divergence compounds into months of bad trades.
The InDecision engine had a published backtest showing a 54% win rate on strong-conviction signals. The live system, measured over comparable signals across an equivalent time window, was actually winning on 21.5% of trades.
A 32-point gap. Not noise. Not model decay. Not bad luck. The backtest and the live system were solving fundamentally different problems, and the gap had been compounding for weeks before anyone thought to do a shape comparison.
The Three Root Causes
When the divergence was finally investigated, three compounding bugs surfaced:
-
Data leakage in the backtest. The backtest was computing features using data that would not have been available at the time of the signal. A feature that included same-day closing price was being computed on historical data that already had the close, producing an artificially strong signal. Live inference did not have access to the close (it was running mid-day) and produced weaker scores.
-
Execution cost mismatch. The backtest assumed orders filled at midpoint prices with zero slippage. Live execution hit the ask (for buys) and the bid (for sells), losing 1-2% per round trip. Compounded across dozens of trades, this accounted for roughly 15 percentage points of the gap.
-
Feature alignment. Two features were computed with subtly different windows in training vs inference. A 14-day rolling average in training was being compared against a 10-day rolling average in inference because of a bug in the live feature pipeline. The scores looked similar but encoded different information.
None of the three bugs alone would have caused a 32-point gap. Together they compounded. And none of them would have been caught by standard unit tests — each component individually passed its tests; the gap was in how the components composed at the system level.
Inline Diagram — The Gap
The Alignment Test
The fix is a nightly alignment job that runs the backtest and the live system on the same window of signals and compares the output distributions:
-- Backtest rerun on last week
SELECT AVG(win) as backtest_win_rate,
COUNT(*) as backtest_signals
FROM backtest_simulated_trades
WHERE created_at > NOW() - INTERVAL '7 days';
-- Live results on same window
SELECT AVG(CASE WHEN realized_pnl > 0 THEN 1 ELSE 0 END) as live_win_rate,
COUNT(*) as live_signals
FROM live_trades
WHERE created_at > NOW() - INTERVAL '7 days';
If ABS(backtest_win_rate - live_win_rate) > 0.05, alert. A five-point gap is the trigger for investigation; a ten-point gap is the trigger to halt new trades until the divergence is explained.
The Three Compounding Detectors
Beyond the top-line win rate, the alignment test should also compare:
- Feature distribution statistics. Mean and standard deviation of each feature in training vs inference. Drift here catches alignment bugs early.
- Signal score distribution. Histogram of composite scores. A shift in the mean or tail of this histogram often appears before the win rate drops.
- Execution cost. Average slippage per trade, actual vs assumed. Drift here usually means the backtest is assuming ideal fills.
All three can be computed from already-persisted data in one query each. They take less than a minute to run. Skipping them is how 32-point gaps hide for weeks.
The Rule
Nightly alignment tests are mandatory on every scoring system with a backtest. The comparison covers win rate, feature distribution, score distribution, and execution cost. Any gap above tolerance halts deploys. The InDecision 54 to 21.5 divergence is the cautionary example — and the exact reason alignment testing is not optional.