Ask Knox

Every team says they care about code quality. Most teams mean it. The gap is not intention — it is enforcement.

A quality guideline says "aim for 90% test coverage." A quality gate says "if coverage is below 90%, the build fails and no one merges." The second statement is structurally different from the first. It removes the human decision from the loop. There is no "just this once" on a deadline. There is no "we'll add tests next sprint." The gate is the standard, and the standard is enforced by the CI system, not by individual judgment under pressure.

The Agent Broker's CI gate has three requirements. Each one exists because a specific class of failure was observed. Understanding why each threshold was set is as important as knowing the threshold.

Gate 1 — 90% Unit Coverage

The 90% coverage floor is a deployment gate. CI will not pass, and therefore code will not merge to main, if coverage drops below 90%.

Why 90% Specifically

The 90% threshold is a practical calibration, not an arbitrary number. It was arrived at by asking: what coverage level ensures that the most consequential code paths have been exercised in a controlled environment?

At 80% coverage, it is entirely possible for entire subsystems to have zero tests. In the Agent Broker at 88%, the kill switch API had 35% coverage, the dispatcher had 53% coverage, and the audit API had 45% coverage. These are not edge cases — they are core operational components. The 10 points between 80% and 90% is not a small margin. It is the difference between a codebase where everything critical has been run in a test environment and a codebase where significant operational components have never been exercised.

At 95%+ coverage, the marginal tests are often testing error paths that require highly specific mock setups for minimal benefit. The cost-to-benefit ratio shifts. 90% sits in the zone where every meaningful code path is covered and the additional effort to push higher would be spent on diminishing returns.

What 90% Actually Prevents

Test coverage does not prevent bugs in covered code. A function can be called in a test, pass, and still have bugs that emerge under different conditions. Coverage prevents a specific, distinct failure class: code that has never been executed in a controlled environment reaching production.

Uncovered code in a production system is code that has only ever been run in production. You do not know what it does under error conditions. You do not know if it handles edge cases correctly. You do not know if a refactor broke it, because no test will catch the regression.

The 90% floor ensures that this class of unknown code represents less than 10% of the codebase — and that the remaining 10% is concentrated in low-risk areas, not in auth middleware and safety pipelines.

The Coverage Gap Analysis

When coverage falls below the floor, the diagnostic is straightforward. Look at the coverage term-missing report and find the files with the most uncovered statements and the lowest coverage percentage. Fix those files first.

In the Agent Broker case, three files accounted for 72 of the 252 uncovered statements:

broker/api/kill_switch.py — 34 miss, 35% cover
broker/core/dispatcher.py — 26 miss, 53% cover
broker/api/audit.py — 12 miss, 45% cover

Covering these three files completely would push total coverage from 88% to approximately 92%. This is the ROI calculation: spend 3-4 hours adding tests for three files, close the coverage gap entirely, and gain 3 well-tested critical components as a side effect.

Gate 2 — 100% Safety-Critical Tests Passing

The overall coverage floor allows 10% of code to be untested. The safety-critical gate does not. Every test in the safety-critical suite must pass. No exceptions.

What defines "safety-critical" for a system:

The code that enforces authorization decisions
The code that triggers, maintains, and resolves emergency states
The code that logs and audits system behavior for compliance
The composition of safety components in the runtime message handler

For the Agent Broker, this means:

tests/unit/test_kill_switch_api.py — all tests must pass at 100%
tests/unit/test_message_pipeline.py — all 18 tests (including 15 pipeline + 3 heartbeat) must pass
tests/security/ — all security tests must pass (the finding that test_security_suite_placeholder contained only pass — a placeholder masquerading as a security suite — was itself a deployment blocker)

Why 100% Here But Not Everywhere

The 90% floor accepts that some code is difficult to test with high ROI. A deeply nested error handler in a rarely-executed boot path may not be worth the mock complexity to cover. The floor provides slack for these decisions.

Safety-critical paths do not get this slack. A wiring bug in the kill switch API has a specific, unacceptable failure mode: the kill switch fires, the API layer reports success, and nothing happens. A wiring bug in the message pipeline composition means the safety checks that the system is designed to enforce are executed in the wrong order — or not at all.

These failure modes have no acceptable manifestation in production. Therefore the tests that prevent them must pass unconditionally.

What the Message Pipeline Tests Actually Verify

The 15 tests added to test_message_pipeline.py cover the 7-step composition in _make_message_handler:

Deserialize incoming message
Validate message structure
Check hard blocks
Check authority ceiling
Route to destination
Write audit entry
Dispatch via NATS (the message bus the broker publishes to)

Each test verifies that the composition works correctly:

A hard-blocked message does not reach the authority check
An authority violation does not proceed to dispatch
Every message — including rejected ones — generates an audit entry
A deserialization failure produces an error audit entry, not an unhandled exception

These tests would have caught the specific scenario the audit identified as most dangerous: a wiring bug where two steps are in the wrong order, or where an exception in one step suppresses subsequent steps. The individual component unit tests would not catch this. Only composition tests can.

Gate 3 — Flake8 Clean

Zero linting errors before merge. No exceptions.

This is the most mechanical of the three gates and often the most controversial. Linting errors rarely represent functional bugs. A missing blank line between functions does not break anything. Why block merge over it?

The answer is not that the individual lint error matters. It is that the practice of tolerating lint errors produces a codebase where everyone tolerates lint errors. You quickly accumulate 200 lint errors, and addressing them becomes a migration project rather than a maintenance task. The --extend-ignore flags start appearing. The CI job gets commented out. The flake8 config starts exceeding the complexity of the code it is reviewing.

The clean gate prevents this accumulation by making zero violations the invariant. Engineers learn to write lint-clean code by default because the gate is always there. The cognitive overhead becomes negligible. The alternative — periodic lint cleanups — is both more expensive and less effective.

The Flake8 Configuration

The Agent Broker's .flake8 configuration:

[flake8]
max-line-length = 100
extend-ignore = E203, W503
exclude = .git, __pycache__, .venv, build

E203 and W503 are ignored because they conflict with the decisions of black (the standard Python autoformatter) on slice notation and line break placement. max-line-length = 100 rather than the default 79 reflects modern screen widths. These are the minimal deviations from default that reflect practical Python development in 2026.

Everything else is enforced. Unused imports. Undefined names. Ambiguous variable names. These are not pedantic style rules — they are functional indicators. An unused import is often a stale dependency or a forgotten refactor. An undefined name in a type annotation is a silent NameError waiting for a type checker to catch. The linter catches them cheaply before the runtime does.

The Gates as a System

The three gates work together:

90% coverage ensures that code has been exercised in a controlled environment
100% safety-critical tests ensures that the specific failure modes that would be catastrophic in production are prevented unconditionally
Flake8 clean ensures that the codebase accumulates no low-level correctness issues

No single gate is sufficient. High coverage with no safety-critical gate means a perfectly-covered codebase where the kill switch stub passes tests by asserting True. Perfect safety-critical tests with poor overall coverage means the safety paths are verified but surrounding code is opaque. Clean linting with no coverage means a stylistically consistent codebase where large swaths of logic have never been run.

Together, they define a standard that a production AI safety system must meet. The Agent Broker's resolution session took coverage from 88% to 90.16%, brought the kill switch API from 35% to 100%, and fixed both security placeholder tests. All three gates passed on the final CI run.

Lesson Drill

Audit your current CI configuration. For each of the three gates:

Is it enforced by CI (build fails) or advisory (warning only)?
If advisory, what is the mechanism for ensuring it is addressed before merge?
What specific failure class does each gate prevent?

For any gate that is currently advisory, estimate the cost of converting it to enforcement — the test additions, the lint fixes, the initial churn. Then estimate the cost of the failure class it prevents. The ROI calculation is almost always clear.

Bottom Line

Quality gates that block CI are quality gates. Quality guidelines that can be bypassed are not. The three-gate standard — 90% unit coverage, 100% safety-critical tests, flake8 clean — was designed to prevent specific failure classes observed in production AI systems. Each threshold exists for a reason. The enforcement mechanism is what gives them teeth.

The CI Gate — Why Non-Negotiable Quality Floors Work