Ceiling Analysis Before Shipping

Before you write a scoring function, build the spreadsheet.

Five minutes of arithmetic would have caught the Agent Framework calibrator failure before any code shipped. Ceiling analysis is cheaper than unit tests, faster than backtesting, and catches a failure mode that neither of those layers can detect.

The Spreadsheet

Every scoring system gets a ceiling analysis before the first line of code is written. It has five columns and one row per component:

Component	Max	Expected Fire Rate	Upper-Bound Contribution (max × fire rate)	Ceiling Without This
Grok	30	82%	24.6	70
Perplexity	30	75%	22.5	70
News	15	90%	13.5	85
Calibrator	25	3.8%	0.95	75
Totals	100	—	61.55	—
Threshold	—	—	70	—

Note that the "Upper-Bound Contribution" column multiplies each component's maximum by its fire rate — it assumes a component scores its full max whenever it fires, which is optimistic. The honest expectation is fire_rate × E[value | fired], and the mean score when a component fires is always below its max, so the true expected composite is lower than 61.55. That only strengthens the conclusion below.

Three numbers jump out of this table and render the entire question trivial:

Upper-bound contribution is 61.55. That is the optimistic ceiling on an average signal's score — and it is already below the threshold of 70. Since the realistic expectation is even lower, the average signal cannot clear the bar.
The calibrator's expected contribution is 0.95. Twenty-five points of maximum, less than one point of expected contribution. That is a dead component sitting in the middle of the rubric.
The ceiling without the calibrator is 75. Only 5 points above the threshold. The 96.2% of signals that fail to match a calibrator market are effectively locked out.

Every one of those numbers is visible before you write a single if statement. The analysis takes five minutes. It costs nothing except the honesty of writing down real fire rates instead of aspirational ones.

Inline Diagram — The Gap

The Questions to Ask

When building the spreadsheet, be ruthlessly honest about fire rates. The easy mistake is to write down aspirational rates — "the calibrator will match 60% of markets once we tune the matcher" — and then never come back to verify. Use the rates you actually expect on day one, before any tuning. If the rubric cannot clear the threshold on day-one fire rates, it is broken and no amount of tuning will save it.

Then ask the load-bearing question for every component:

What is the ceiling with this component removed entirely?
Is that ceiling above the threshold with meaningful headroom?
If not, is this component guaranteed to fire for reasons I can verify?

A component that is load-bearing and cannot be guaranteed to fire is a ticking clock. Fix it before shipping — rebalance the weights, or add a fallback path, or remove the dependency entirely.

The Rule

Ceiling analysis is a pre-flight check. It runs before code. It runs before tests. It runs before the design review. It is the cheapest quality gate in the entire scoring-system lifecycle, and it catches the one failure mode that all the other gates miss: a mathematically doomed rubric that ships looking perfect.

Build the spreadsheet. Check the numbers. Ship with headroom.