ASK KNOX
beta
LESSON 284

The 340-Test Bot That Never Traded

The Hermes story as case study. 95% coverage, 5 docs, a runbook, a watchdog — and zero real trades for weeks. What test coverage measures, what it does not, and why this gap is the cultural failure that defines production ML.

8 min read

This is the story that made the other five tracks in this set of lessons necessary.

Hermes had every marker of a well-engineered system. 340 unit and integration tests, passing on every commit. 95% code coverage. A comprehensive runbook. A Horus watchdog service monitoring the process. Five documentation files covering architecture, operations, and API usage. CI gates on every PR. Code review on every merge.

It had never placed a real trade.

For weeks — possibly longer — the bot started every day, evaluated signals, computed composite scores, and produced zero trades. The watchdog reported it as healthy. The tests kept passing. The dashboards looked green. And the trade count stayed at zero.

The Gap

Test coverage and operational validity are orthogonal axes.

Test coverage tells you...Operational validity tells you...
Does the code compile?Is the process running?
Do the functions return expected values for known inputs?Are real-world inputs producing real-world outputs?
Are the error paths handled?Is the system doing the thing it was built for?
Will refactors break behavior?Did the last signal actually turn into a trade?

The two columns share no rows. A bot can score perfectly on the left and score zero on the right. Hermes did.

The deeper cultural failure is that "tests passing" became a proxy for "system working" — and the proxy became load-bearing. Engineers built elaborate test infrastructure and treated green CI as the end state. They did not build equivalent infrastructure for validating that the running system actually produced its intended outcomes. That work was considered less interesting, or less measurable, or just assumed to follow from having good tests. It did not.

Inline Diagram — The Two Axes

CODE INTEGRITY vs OPERATIONAL VALIDITYCODE INTEGRITY →OPERATIONAL VALIDITY →RARE — HIGH OPS / LOW CODEthe running-but-untested startupGOAL — HIGH BOTHtests pass AND trades happenLOW BOTH — dead codeHERMES — HIGH CODE / LOW OPS340 tests, 95% coverage,0 real trades for weeks

What Would Have Caught It

Three instrumentation changes would have surfaced the failure within a day of it appearing:

  1. Fire-rate metric on every scoring component (Lesson 258). The calibrator's 3.8% rate would have paged immediately.
  2. "Last real trade" as a first-class metric. A bot that has gone 48 hours without a trade is either in a quiet market or broken. Either way, someone should look.
  3. Ceiling analysis as a pre-flight check (Lesson 259). The effective-ceiling-without-calibrator number (75, just 5 above threshold) would have flagged the load-bearing risk before any signal was scored.

None of these are tests. They are operational instruments. They measure whether the running system is doing its job — which is a question tests cannot answer no matter how many of them you write.

The same relationship holds for tests and systems. Tests are worthless without a working system — but the work of writing them still matters because it produces the code integrity that the working system depends on. You need both. Skipping either is how Hermes happened.

The Rule

A bot is not working because its tests pass. A bot is working because the running system produces its intended outcome. Track both metrics. Treat a gap between them as the cultural failure it is — and close the gap by instrumenting operational validity, not by adding more tests to a system that already passes all of them.