Ask Knox

The InDecision engine had a published backtest showing a 54% win rate on strong-conviction signals. The live system, measured over comparable signals across an equivalent time window, was actually winning on 21.5% of trades.

A 32.5-point gap. Not noise. Not model decay. Not bad luck. The backtest and the live system were solving fundamentally different problems, and the gap had been compounding for weeks before anyone thought to do a shape comparison.

The Three Root Causes

When the divergence was finally investigated, three compounding bugs surfaced:

Data leakage in the backtest. The backtest was computing features using data that would not have been available at the time of the signal. A feature that included same-day closing price was being computed on historical data that already had the close, producing an artificially strong signal. Live inference did not have access to the close (it was running mid-day) and produced weaker scores.
Execution cost mismatch. The backtest assumed orders filled at midpoint prices with zero slippage. Live execution hit the ask (for buys) and the bid (for sells), losing 1-2% per round trip. The reason a 1-2% price cost translates into roughly 15 percentage points of win rate is that many of the engine's signals won by less than the round-trip cost: a trade that cleared backtest by half a percent becomes a loss once live execution shaves 1-2% off it. The per-trade cost is small, but it flips every marginal winner into a loser, so it lands disproportionately on the win-rate count rather than on average PnL alone.
Feature alignment. Two features were computed with subtly different windows in training vs inference. A 14-day rolling average in training was being compared against a 10-day rolling average in inference because of a bug in the live feature pipeline. The scores looked similar but encoded different information.

None of the three bugs alone would have caused a 32.5-point gap. Together they compounded. And none of them would have been caught by standard unit tests — each component individually passed its tests; the gap was in how the components composed at the system level.

Inline Diagram — The Gap

The Alignment Test

The fix is a nightly alignment job that runs the backtest and the live system on the same window of signals and compares the output distributions:

-- Backtest rerun on last week
SELECT AVG(win) as backtest_win_rate,
       COUNT(*) as backtest_signals
FROM backtest_simulated_trades
WHERE created_at > NOW() - INTERVAL '7 days';

-- Live results on same window
SELECT AVG(CASE WHEN realized_pnl > 0 THEN 1 ELSE 0 END) as live_win_rate,
       COUNT(*) as live_signals
FROM live_trades
WHERE created_at > NOW() - INTERVAL '7 days';

If ABS(backtest_win_rate - live_win_rate) > 0.05, alert. A five-point gap is the trigger for investigation; a ten-point gap is the trigger to halt new trades until the divergence is explained.

Is the gap real, or is it noise?

A raw five-point threshold will flap on low-volume weeks. On 15 trades, a five-point win-rate gap is well inside ordinary binomial noise, and an alert that fires on noise trains operators to ignore it. Two guards keep the alert honest:

Minimum-N guard. Do not page on a win-rate gap until both sides have accumulated a meaningful sample — at least ~100 trades in the window. Below that, log the gap but suppress the page.
Significance test. Before alerting, run a two-proportion (binomial) test on the two win rates. Page only when the gap is significant at the chosen confidence level, not merely larger than 0.05.

For the distribution comparisons — feature and score histograms — mean and standard deviation are a coarse first pass. The standard drift statistics are PSI (Population Stability Index) and the KS (Kolmogorov-Smirnov) test, which compare the full shape of the backtest and live histograms rather than just their first two moments.

The Three Compounding Detectors

Beyond the top-line win rate, the alignment test should also compare:

Feature distribution statistics. Mean and standard deviation of each feature in training vs inference. Drift here catches alignment bugs early.
Signal score distribution. Histogram of composite scores. A shift in the mean or tail of this histogram often appears before the win rate drops.
Execution cost. Average slippage per trade, actual vs assumed. Drift here usually means the backtest is assuming ideal fills.

All three can be computed from already-persisted data in one query each. They take less than a minute to run. Skipping them is how 32.5-point gaps hide for weeks.

The Rule