Ask Knox

The Agent Broker has a 90% test coverage floor. That is the standard across the codebase. tests/safety/ has a different standard: 100%. Non-negotiable.

The kill switch tests are the only part of the entire system where a missed code path is treated as a blocking failure rather than a coverage metric to address in a future PR. The CI configuration enforces this separately:

# 90% floor for the full codebase
pytest tests/ --cov=broker --cov-fail-under=90

# 100% required for safety code
pytest tests/safety/ --cov=broker/safety --cov-fail-under=100

Both commands run in CI. Both must pass. A PR that improves general coverage but drops kill switch coverage below 100% does not merge.

Why 100% and Not 90%

The 90% floor exists because the last 10% of coverage is often structurally difficult — error handling for OS errors that are hard to trigger in tests, defensive code paths for conditions that don't occur in practice. Chasing 100% in those cases produces brittle tests that mock too much and test too little.

Safety code is different. The whole point of safety code is that it handles the conditions that don't occur in normal operation. The code paths you can't easily trigger in tests are precisely the ones that matter most:

What happens when launchctl stop throws an OSError?
What happens when the SSH connection to Tesseract times out?
What happens when the SQLite registry database doesn't exist?
What happens when _revoke_all_tokens encounters a malformed schema?
What happens when you call level_1_halt after level_2_halt?

These are the scenarios where an untested code path becomes a live incident. 100% coverage is the only way to have confidence that when any of these happen, the code does what the tests say it does.

The Test Structure

The test file opens with the design constraint stated explicitly:

"""
Kill Switch tests — 100% coverage REQUIRED. Non-negotiable.

Tests all 4 levels, confirmation phrase, protected daemons,
Tesseract SSH halt, token revocation, env locking, and timing.
"""

The test classes map directly to the code's responsibility surfaces:

TestLevel1AssetHalt — Level 1 behavior
TestLevel2TradingHalt — Level 2 behavior, including partial failure
TestPinEnforcement — PIN checking for Levels 3 and 4
TestLevel3AgentFreeze — Level 3 behavior, protected daemon exclusion
TestLevel4FullStop — Level 4 full sequence, confirmation gate, each failure mode
TestStopDaemon — The daemon stop primitive, including error handling
TestHaltTesseract — SSH halt, batching, unreachability
TestTokenRevocation — All three DB states (success, missing DB, missing config)
TestRestoreFromDb — State persistence and restoration
TestPersist — DB persistence
TestEnvLocking — File locking, success and failure
TestStaticMethods — Class methods that return daemon lists
TestLevelProgression — The invariant that levels never decrease
TestHaltResult — The result dataclass defaults and properties
TestResume — The resume/reset path
TestHaltDispatch — The unified halt() dispatch method

Each class is a focused test group. The structure makes it obvious what is being tested and why.

What "Comprehensive" Actually Looks Like

Consider TestLevel4FullStop. A naive test would have one test that triggers a successful Level 4 and asserts success is True. That tests the happy path and nothing else.

The actual test class has 10 tests:

class TestLevel4FullStop:
    def test_full_sequence(...)          # happy path
    def test_wrong_confirmation_rejected(...)  # wrong phrase
    def test_empty_confirmation_rejected(...)  # empty string
    def test_case_sensitive_confirmation(...)  # lowercase
    def test_watchdog_excluded_from_level4(...)   # protected daemon
    def test_nats_excluded_from_level4(...)    # protected daemon
    def test_failed_revocation_marks_failure(...)  # revoke fails
    def test_tesseract_unreachable_noted(...)  # SSH fails but shutdown succeeds
    def test_daemon_failure_in_level4(...)     # partial daemon failure
    def test_active_level_set(...)             # state mutation

The confirmation gate tests three distinct input variations because the string comparison is case-sensitive:

def test_wrong_confirmation_rejected(self, ks):
    result = ks.level_4_shutdown("wrong phrase", "knox")
    assert result.success is False
    assert "Invalid confirmation" in result.error

def test_empty_confirmation_rejected(self, ks):
    result = ks.level_4_shutdown("", "knox")
    assert result.success is False

def test_case_sensitive_confirmation(self, ks):
    result = ks.level_4_shutdown("shutdown trading", "knox")
    assert result.success is False

Each is a different code path. The empty string test verifies the gate doesn't have a bug where empty strings pass. The lowercase test verifies case sensitivity is enforced. These are not redundant — they are testing three distinct entry points into the same conditional.

The 2am Test

There is an informal metric beyond coverage percentage. The kill switch must work at 2am when Knox is half-asleep and something is on fire.

What does this mean in practice? It means the kill switch must work from:

A phone screen with brightness turned down
A terminal with no IDE assistance
A mental state that is not operating at full capacity

This is not a test you write in pytest. It is a design constraint that shapes every decision in the implementation. Why the confirmation phrase is a human-readable sentence and not a UUID. Why broker-halt is a bash script and not a Python module you need to invoke correctly. Why the log file goes to /tmp where you can always find it without knowing the project structure.

The 2am test is a thought experiment you apply to every decision: "If Knox has been asleep for 3 hours and something is on fire, can he do this without making a mistake?"

The test suite enforces the code is correct. The 2am test enforces the code is usable.

Testing the Unhappy Paths

The most instructive tests are not the happy path tests. They are the ones that verify correct behavior when things go wrong.

OSError in _stop_daemon

@patch("subprocess.run", side_effect=OSError("not found"))
def test_os_error_handled(self, mock_run, ks):
    result = ks._stop_daemon("com.operator.test")
    assert result is False

This verifies that an OSError from subprocess.run — which can happen if launchctl is not found on the system, or if the subprocess API has a permission problem — returns False rather than propagating the exception. During a Level 4 shutdown, an unhandled exception in _stop_daemon would prevent the loop from continuing to the next daemon.

Token Revocation with Missing DB

def test_revoke_missing_db_returns_false(self, tmp_path, config):
    config.registry_db_path = str(tmp_path / "no_such_file.db")
    ks = KillSwitch(config)
    assert ks._revoke_all_tokens() is False

def test_revoke_no_db_path_returns_false(self, config):
    config.registry_db_path = None
    ks = KillSwitch(config)
    assert ks._revoke_all_tokens() is False

Two different failure modes: the config has a path but the file doesn't exist, versus the config has no path at all. Both must return False rather than raising. And importantly, this False propagates into result.success for Level 4, making the failure visible in the halt result.

Token Revocation with Corrupted Schema

def test_revoke_exception_returns_false(self, tmp_path, config):
    import sqlite3

    db_file = tmp_path / "bad_registry.db"
    conn = sqlite3.connect(str(db_file))
    conn.execute("CREATE TABLE agent_registry (agent_id TEXT PRIMARY KEY)")
    # Missing auth_token_hash column — UPDATE will fail
    conn.execute("INSERT INTO agent_registry VALUES (?)", ("foresight",))
    conn.commit()
    conn.close()

    config.registry_db_path = str(db_file)
    ks = KillSwitch(config)
    result = ks._revoke_all_tokens()
    assert result is False

This test creates a real SQLite database with the wrong schema. The UPDATE agent_registry SET auth_token_hash = ? will fail because the column doesn't exist. The test verifies that the exception is caught and False is returned — not that the exception propagates up to the Level 4 shutdown and aborts the entire sequence.

Env Locking with Permission Error

def test_lock_handles_os_error(self, ks):
    with patch("os.chmod", side_effect=OSError("perm denied")):
        with patch("pathlib.Path.exists", return_value=True):
            result = ks._lock_env_files()
            assert result is False

chmod can fail if the process doesn't own the file. The test verifies this returns False rather than raising, and that result.success in Level 4 will be False if env files cannot be locked.

Token Revocation Integration Test

The token revocation test is notable because it uses a real SQLite database, not a mock:

def test_revoke_returns_true(self, tmp_path, config):
    import sqlite3

    db_file = tmp_path / "registry.db"
    conn = sqlite3.connect(str(db_file))
    conn.execute("""
        CREATE TABLE agent_registry (
            agent_id TEXT PRIMARY KEY,
            auth_token_hash TEXT NOT NULL
        )
    """)
    conn.execute(
        "INSERT INTO agent_registry VALUES (?, ?)",
        ("foresight", "$2b$12$abcdef1234567890"),
    )
    conn.commit()
    conn.close()

    config.registry_db_path = str(db_file)
    ks = KillSwitch(config)
    result = ks._revoke_all_tokens()
    assert result is True

    # Verify the token hash was actually changed
    conn2 = sqlite3.connect(str(db_file))
    row = conn2.execute(
        "SELECT auth_token_hash FROM agent_registry WHERE agent_id='foresight'"
    ).fetchone()
    conn2.close()
    assert row is not None
    assert row[0] != "$2b$12$abcdef1234567890", (
        "Token hash should have been replaced"
    )

This test does not mock sqlite3. It creates a real database, populates it with a real row, runs token revocation, then reads back the database to verify the hash was actually changed. Mocking sqlite3 would give you confidence that the code calls the right methods — not confidence that the actual revocation works.

This is the difference between testing behavior and testing implementation. For safety code, you test behavior.

What 100% Coverage Doesn't Guarantee

100% line coverage means every line was executed by at least one test. It does not mean:

Every combination of state was tested
The tests themselves are correct
The mocks accurately reflect real system behavior
The timing and ordering assumptions hold in production

100% coverage is a floor, not a ceiling. The comprehensive test suite for the kill switch goes beyond line coverage — it tests behavior contracts, error propagation, state invariants, and integration with real databases.

The tests are the specification. When something breaks in production, the first question is: "Does a test for this scenario exist?" If not, the fix requires both the code change and a new test before the PR can merge.

Coverage enforces that tests exist. Thoughtful test design determines whether the tests are worth having.

Why Coverage Lies — and What Mutation Testing Reveals

100% line coverage is necessary but not sufficient. It tells you every line was executed by at least one test. It does not tell you whether any test would fail if that line's logic were changed.

Mutation testing answers the harder question. A mutation tool takes your source file and generates a set of mutants — modified copies of the code, each with one small change: a >= flipped to >, a True changed to False, an and changed to or, a return x replaced with return None. For each mutant, it runs your entire test suite. If any test fails, the mutant is killed — your tests detected the change. If all tests still pass, the mutant survived — and a surviving mutant means you have a code path your tests cannot distinguish from a broken version of that code.

The Boundary-Value Trap

The worked example above is not theoretical. Consider a safety guard like this:

def approve_action(risk_score: float):
    if risk_score >= 0.8:
        raise SafetyViolation("risk too high")
    execute_action()

A test that calls approve_action(0.5) and approve_action(0.9) achieves 100% line coverage — both branches execute. But if no test calls approve_action(0.8) exactly, the mutant risk_score > 0.8 survives: your tests cannot tell the difference between >= and > at the boundary. An action with risk_score == 0.8 will silently slip through the guard in the mutated version, and your 100%-coverage suite reports green the entire time.

This is the canonical way a safety module passes CI with full coverage while containing a verifiably undetected defect.

Mutation Score

The metric mutation testing produces is the mutation score:

mutation score = killed mutants / total mutants generated

A score of 100% means every injected fault was caught by at least one test. A score of 60% means 40% of the guard changes your tool generated went completely unnoticed — your tests are indifferent to nearly half the ways the logic could be wrong.

For safety-critical paths — the kill switch levels, token revocation, env locking, level-progression invariants — the appropriate target is 95%+ kill rate. For general code, 80%+ is a practical threshold that balances coverage quality against the cost of writing exhaustive boundary tests everywhere.

Tooling

The concepts are language-agnostic. Common implementations:

mutmut (Python) — the simplest entry point; integrates with pytest; generates HTML reports showing each surviving mutant with the exact diff
Stryker (JavaScript/TypeScript, C#, Scala) — mature, supports incremental mutation to contain CI cost
PITest (Java) — bytecode-level mutation, fast on large codebases

You do not need to run a full mutation sweep on every CI push. The cost-aware approach is targeted runs: mutation testing on safety/ paths on every PR that touches those files, and a full sweep in a scheduled nightly job or pre-release gate. Trying to mutation-test an entire codebase on every commit is the reason teams disable it after a week.

The Implication for the Kill Switch

The kill switch test suite is structured around boundary values and error paths precisely because those are the inputs mutation testing would reveal as untested. The three confirmation gate variations (wrong phrase, empty string, lowercase) each kill a distinct mutant. The >=-style boundary tests on level progression (test_level_never_decreases) exist because a max() call that was silently changed to min() would survive a test that only checks the final value from above.

The 2am test and 100% line coverage tell you the code runs. Mutation score tells you whether the tests could detect it being wrong.

100% Safety Test Coverage