Ceiling Analysis Before Shipping
The pre-flight check every scoring system needs. Build the spreadsheet before you write the scoring code. What is the maximum achievable score without each component? Can the threshold be cleared on realistic fire rates? The Hermes failure would have been caught in five minutes by anyone willing to do arithmetic first.
Before you write a scoring function, build the spreadsheet.
Five minutes of arithmetic would have caught the Hermes calibrator failure before any code shipped. Ceiling analysis is cheaper than unit tests, faster than backtesting, and catches a failure mode that neither of those layers can detect.
The Spreadsheet
Every scoring system gets a ceiling analysis before the first line of code is written. It has five columns and one row per component:
| Component | Max | Expected Fire Rate | Effective Contribution | Ceiling Without This |
|---|---|---|---|---|
| Grok | 30 | 82% | 24.6 | 70 |
| Perplexity | 30 | 75% | 22.5 | 70 |
| News | 15 | 90% | 13.5 | 85 |
| Calibrator | 25 | 3.8% | 0.95 | 75 |
| Totals | 100 | — | 61.55 | — |
| Threshold | — | — | 70 | — |
Three numbers jump out of this table and render the entire question trivial:
- Effective contribution is 61.55. That is the expected score on an average signal. It is below the threshold of 70. The average signal cannot clear the bar.
- The calibrator's expected contribution is 0.95. Twenty-five points of maximum, less than one point of expected contribution. That is a dead component sitting in the middle of the rubric.
- The ceiling without the calibrator is 75. Only 5 points above the threshold. The 96.2% of signals that fail to match a calibrator market are effectively locked out.
Every one of those numbers is visible before you write a single if statement. The analysis takes five minutes. It costs nothing except the honesty of writing down real fire rates instead of aspirational ones.
Inline Diagram — The Gap
The Questions to Ask
When building the spreadsheet, be ruthlessly honest about fire rates. The easy mistake is to write down aspirational rates — "the calibrator will match 60% of markets once we tune the matcher" — and then never come back to verify. Use the rates you actually expect on day one, before any tuning. If the rubric cannot clear the threshold on day-one fire rates, it is broken and no amount of tuning will save it.
Then ask the load-bearing question for every component:
- What is the ceiling with this component removed entirely?
- Is that ceiling above the threshold with meaningful headroom?
- If not, is this component guaranteed to fire for reasons I can verify?
A component that is load-bearing and cannot be guaranteed to fire is a ticking clock. Fix it before shipping — rebalance the weights, or add a fallback path, or remove the dependency entirely.
The Rule
Ceiling analysis is a pre-flight check. It runs before code. It runs before tests. It runs before the design review. It is the cheapest quality gate in the entire scoring-system lifecycle, and it catches the one failure mode that all the other gates miss: a mathematically doomed rubric that ships looking perfect.
Build the spreadsheet. Check the numbers. Ship with headroom.