Operational Validation Checklist
What 'done' actually means. Running system produces intended outcome equals done. Merged PR equals not done. Tests passing equals not done. Deployed equals not done. The cultural correction that closes the loop Hermes exposed.
The cultural fix from the Hermes retro is a checklist. It is short. It is non-negotiable. Every scoring bot — and, with minor adaptations, every production system — runs through it before a feature is called done.
The Checklist
A feature is not done until:
- Code is implemented. Not stubbed. Not TODO'd. Not
NotImplementedError'd. - Tests exist and pass. Unit tests for each new function. Integration tests for each new interaction. All passing locally and in CI.
- PR is merged. After review, after CI is green, after any CodeRabbit comments on math or boundaries are resolved.
- Deploy is complete. The new code is on the target machine, the process has restarted with the new code, and the new version is visibly running.
- Running system produces the intended outcome. This is the step everyone skips. Query the database. Check the log. Look at the dashboard. Confirm that the thing the feature was supposed to do is actually happening, on real inputs, right now.
Hermes had steps 1 through 4. It did not have step 5. That gap is the entire story.
Inline Diagram — The Missing Step
What Step 5 Looks Like
For different domains, step 5 takes different shapes:
- Scoring bot: Query the database for signals produced in the last hour. Confirm the composite scores look right and the clearance count is within the expected range. Check the fire rate dashboard. Confirm the 'hours since last real trade' counter has reset if this feature was expected to unblock trades.
- Data pipeline: Run the pipeline end-to-end on current inputs. Confirm outputs land in the expected destination with the expected schema. Compare the output distribution to the last known-good run.
- Frontend change: Navigate to the affected page in a real browser (ideally via Playwright). Confirm the new feature is visible and interactive. Screenshot the state for the retro.
- API change: Call the endpoint from curl with a realistic payload. Confirm the response shape, the status code, and any side effects (database writes, events emitted).
- Scheduled job: Wait for the next scheduled run (or trigger manually) and confirm the expected artifacts appear — the blog post, the digest, the metric, the notification.
Every case takes minutes at most. The cost is small; the discipline of always running it is what prevents the Hermes pattern.
The "Not Done Yet" Clause
A feature labeled "shipped" that cannot point to a concrete instance of its intended outcome is not shipped. It is deployed. Those are different words. Use the right one. "Hermes rebalance shipped" is a different statement from "Hermes rebalance produced 8 cleared signals in the last 24 hours." The second statement is a delivery report. The first is an update on intermediate state.
The Cultural Correction
The whole track exists because "tests pass and PR merged" was treated as delivery for weeks while the actual delivery — a working trading bot — did not exist. Fixing this is not a tooling problem. It is a cultural correction that has to be adopted as a discipline: before calling anything done, run step 5. Always. Even under time pressure. Especially under time pressure.
The Rule
Done means the running system produces the intended outcome. Every other state is intermediate. Step 5 of the checklist is mandatory and non-negotiable. The Hermes failure is the permanent reminder of what skipping step 5 costs — and the five tracks in this set of lessons exist to close the gap for every other bot, scoring system, and feature that follows.