Operational Validation Checklist

The cultural fix from the Agent Framework retro is a checklist. It is short. It is non-negotiable. Every scoring bot — and, with minor adaptations, every production system — runs through it before a feature is called done.

The Checklist

A feature is not done until:

Code is implemented. Not stubbed. Not TODO'd. Not NotImplementedError'd.
Tests exist and pass. Unit tests for each new function. Integration tests for each new interaction. All passing locally and in CI.
PR is merged. After review, after CI is green, after any CodeRabbit comments on math or boundaries are resolved.
Deploy is complete. The new code is on the target machine, the process has restarted with the new code, and the new version is visibly running.
Running system produces the intended outcome. This is the step everyone skips. Query the database. Check the log. Look at the dashboard. Confirm that the thing the feature was supposed to do is actually happening, on real inputs, right now.

Agent Framework had steps 1 through 4. It did not have step 5. That gap is the entire story.

Inline Diagram — The Missing Step

What Step 5 Looks Like

For different domains, step 5 takes different shapes:

Scoring bot: Query the database for signals produced in the last hour. Confirm the composite scores look right and the clearance count is within the expected range. Check the fire rate dashboard. Confirm the 'hours since last real trade' counter has reset if this feature was expected to unblock trades.
Data pipeline: Run the pipeline end-to-end on current inputs. Confirm outputs land in the expected destination with the expected schema. Compare the output distribution to the last known-good run.
Frontend change: The post-delivery UI QA rule is: every UI change gets a Playwright-verified retro before the feature is called done — navigate to the affected page, confirm the new feature is visible and interactive, and screenshot the state. A screenshot of the working page is the UI equivalent of querying the last real trade timestamp.
API change: Call the endpoint from curl with a realistic payload. Confirm the response shape, the status code, and any side effects (database writes, events emitted).
Scheduled job: Wait for the next scheduled run (or trigger manually) and confirm the expected artifacts appear — the blog post, the digest, the metric, the notification.

Every case takes minutes at most. The cost is small; the discipline of always running it is what prevents the Agent Framework pattern.

The probe changes per domain, but the contract does not: each row replaces "I think it works" with a concrete, observed instance of the intended outcome on real inputs. That observed instance is the only thing the word done is allowed to mean.

The "Not Done Yet" Clause

A feature labeled "shipped" that cannot point to a concrete instance of its intended outcome is not shipped. It is deployed. Those are different words. Use the right one. "Agent Framework rebalance shipped" is a different statement from "Agent Framework rebalance produced 8 cleared signals in the last 24 hours." The second statement is a delivery report. The first is an update on intermediate state.

The Cultural Correction

The whole track exists because "tests pass and PR merged" was treated as delivery for weeks while the actual delivery — a working trading bot — did not exist. Fixing this is not a tooling problem. It is a cultural correction that has to be adopted as a discipline: before calling anything done, run step 5. Always. Even under time pressure. Especially under time pressure.

The Rule

Done means the running system produces the intended outcome. Every other state is intermediate. Step 5 of the checklist is mandatory and non-negotiable. The Agent Framework failure is the permanent reminder of what skipping step 5 costs — and the six tracks in this set of lessons, including this one, exist to close the gap for every other bot, scoring system, and feature that follows: audit-to-ship, infrastructure-drift, silent-failures, delivery-validation, quantitative-scoring, and prediction-markets.