The 340-Test Bot That Never Traded

This is the story that made the other tracks in this set of lessons necessary.

Agent Framework had every marker of a well-engineered system. 340 unit and integration tests, passing on every commit. 95% code coverage. A comprehensive runbook. A Watchdog Service monitoring the process. Five documentation files covering architecture, operations, and API usage. CI gates on every PR. Code review on every merge.

It had never placed a real trade.

For weeks — possibly longer — the bot started every day, evaluated signals, computed composite scores, and produced zero trades. The watchdog reported it as healthy. The tests kept passing. The dashboards looked green. And the trade count stayed at zero.

The Gap

Test coverage and operational validity are orthogonal axes.

Test coverage tells you...	Operational validity tells you...
Does the code compile?	Is the process running?
Do the functions return expected values for known inputs?	Are real-world inputs producing real-world outputs?
Are the error paths handled?	Is the system doing the thing it was built for?
Will refactors break behavior?	Did the last signal actually turn into a trade?

The two columns share no rows. A bot can score perfectly on the left and score zero on the right. Agent Framework did.

The deeper cultural failure is that "tests passing" became a proxy for "system working" — and the proxy became load-bearing. Engineers built elaborate test infrastructure and treated green CI as the end state. They did not build equivalent infrastructure for validating that the running system actually produced its intended outcomes. That work was considered less interesting, or less measurable, or just assumed to follow from having good tests. It did not.

The mechanism is a loop: a passing suite is read as "it works," nothing checks the actual outcome, and the real output sits at zero while the loop keeps spinning on green. The only thing that breaks it is an instrument that reads the real outcome directly.

Inline Diagram — The Two Axes

What Would Have Caught It

Three instrumentation changes would have surfaced the failure within a day of it appearing:

Fire-rate metric on every scoring component (the “Fire-Rate Monitoring” lesson in the Quantitative Scoring System Design track). A fire rate measures how often a scoring component actually contributes a non-zero value. The bot's calibrator — the gatekeeper component that scored whether a candidate signal was tradeable — was contributing on only 3.8% of signals. An alert on that rate would have paged immediately.
"Last real trade" as a first-class metric. A bot that has gone 48 hours without a trade is either in a quiet market or broken. Either way, someone should look.
Ceiling analysis as a pre-flight check (the “Ceiling Analysis Before Shipping” lesson, same track). Ceiling analysis asks: what is the maximum score the system can reach if a given component contributes nothing? Without the calibrator, this bot's effective ceiling was 75 — just 5 points above the trade threshold of 70 — which would have flagged the calibrator as a load-bearing risk before any signal was scored.

None of these are tests. They are operational instruments. They measure whether the running system is doing its job — which is a question tests cannot answer no matter how many of them you write.

The same relationship holds for tests and systems. Tests are worthless without a working system — but the work of writing them still matters because it produces the code integrity that the working system depends on. You need both. Skipping either is how Agent Framework happened.

The Rule

A bot is not working because its tests pass. A bot is working because the running system produces its intended outcome. Track both metrics. Treat a gap between them as the cultural failure it is — and close the gap by instrumenting operational validity, not by adding more tests to a system that already passes all of them.