ASK KNOX
beta
LESSON 203

100% Safety Test Coverage

The kill switch tests are non-negotiable. 100% coverage, not 90%. Here's why the floor is higher for safety code, what the test suite actually covers, and the 2am test that makes it real.

12 min read·Agent Authority & Safety Systems

The Principal Broker has a 90% test coverage floor. That is the standard across the codebase. tests/safety/ has a different standard: 100%. Non-negotiable.

The kill switch tests are the only part of the entire system where a missed code path is treated as a blocking failure rather than a coverage metric to address in a future PR. The CI configuration enforces this separately:

# 90% floor for the full codebase
pytest tests/ --cov=broker --cov-fail-under=90

# 100% required for safety code
pytest tests/safety/ --cov=broker/safety --cov-fail-under=100

Both commands run in CI. Both must pass. A PR that improves general coverage but drops kill switch coverage below 100% does not merge.

Why 100% and Not 90%

The 90% floor exists because the last 10% of coverage is often structurally difficult — error handling for OS errors that are hard to trigger in tests, defensive code paths for conditions that don't occur in practice. Chasing 100% in those cases produces brittle tests that mock too much and test too little.

Safety code is different. The whole point of safety code is that it handles the conditions that don't occur in normal operation. The code paths you can't easily trigger in tests are precisely the ones that matter most:

  • What happens when launchctl stop throws an OSError?
  • What happens when the SSH connection to the trading server times out?
  • What happens when the SQLite registry database doesn't exist?
  • What happens when _revoke_all_tokens encounters a malformed schema?
  • What happens when you call level_1_halt after level_2_halt?

These are the scenarios where an untested code path becomes a live incident. 100% coverage is the only way to have confidence that when any of these happen, the code does what the tests say it does.

The Test Structure

The test file opens with the design constraint stated explicitly:

"""
Kill Switch tests — 100% coverage REQUIRED. Non-negotiable.

Tests all 4 levels, confirmation phrase, protected daemons,
trading server SSH halt, token revocation, env locking, and timing.
"""

The test classes map directly to the code's responsibility surfaces:

  • TestLevel1AssetHalt — Level 1 behavior
  • TestLevel2TradingHalt — Level 2 behavior, including partial failure
  • TestPinEnforcement — PIN checking for Levels 3 and 4
  • TestLevel3AgentFreeze — Level 3 behavior, protected daemon exclusion
  • TestLevel4FullStop — Level 4 full sequence, confirmation gate, each failure mode
  • TestStopDaemon — The daemon stop primitive, including error handling
  • TestHaltTradingServer — SSH halt, batching, unreachability
  • TestTokenRevocation — All three DB states (success, missing DB, missing config)
  • TestRestoreFromDb — State persistence and restoration
  • TestPersist — DB persistence
  • TestEnvLocking — File locking, success and failure
  • TestStaticMethods — Class methods that return daemon lists
  • TestLevelProgression — The invariant that levels never decrease
  • TestHaltResult — The result dataclass defaults and properties
  • TestResume — The resume/reset path
  • TestHaltDispatch — The unified halt() dispatch method

Each class is a focused test group. The structure makes it obvious what is being tested and why.

What "Comprehensive" Actually Looks Like

Consider TestLevel4FullStop. A naive test would have one test that triggers a successful Level 4 and asserts success is True. That tests the happy path and nothing else.

The actual test class has 10 tests:

class TestLevel4FullStop:
    def test_full_sequence(...)                   # happy path
    def test_wrong_confirmation_rejected(...)     # wrong phrase
    def test_empty_confirmation_rejected(...)     # empty string
    def test_case_sensitive_confirmation(...)     # lowercase
    def test_watchdog_excluded_from_level4(...)   # protected daemon
    def test_nats_excluded_from_level4(...)       # protected daemon
    def test_failed_revocation_marks_failure(...) # revoke fails
    def test_trading_server_unreachable_noted(...)# SSH fails but shutdown succeeds
    def test_daemon_failure_in_level4(...)        # partial daemon failure
    def test_active_level_set(...)                # state mutation

The confirmation gate tests three distinct input variations because the string comparison is case-sensitive:

def test_wrong_confirmation_rejected(self, ks):
    result = ks.level_4_shutdown("wrong phrase", "knox")
    assert result.success is False
    assert "Invalid confirmation" in result.error

def test_empty_confirmation_rejected(self, ks):
    result = ks.level_4_shutdown("", "knox")
    assert result.success is False

def test_case_sensitive_confirmation(self, ks):
    result = ks.level_4_shutdown("shutdown invictus", "knox")
    assert result.success is False

Each is a different code path. The empty string test verifies the gate doesn't have a bug where empty strings pass. The lowercase test verifies case sensitivity is enforced. These are not redundant — they are testing three distinct entry points into the same conditional.

The 2am Test

There is an informal metric beyond coverage percentage. The kill switch must work at 2am when Knox is half-asleep and something is on fire.

What does this mean in practice? It means the kill switch must work from:

  • A phone screen with brightness turned down
  • A terminal with no IDE assistance
  • A mental state that is not operating at full capacity

This is not a test you write in pytest. It is a design constraint that shapes every decision in the implementation. Why the confirmation phrase is a human-readable sentence and not a UUID. Why principal-halt is a bash script and not a Python module you need to invoke correctly. Why the log file goes to /tmp where you can always find it without knowing the project structure.

The 2am test is a thought experiment you apply to every decision: "If Knox has been asleep for 3 hours and something is on fire, can he do this without making a mistake?"

The test suite enforces the code is correct. The 2am test enforces the code is usable.

Testing the Unhappy Paths

The most instructive tests are not the happy path tests. They are the ones that verify correct behavior when things go wrong.

OSError in _stop_daemon

@patch("subprocess.run", side_effect=OSError("not found"))
def test_os_error_handled(self, mock_run, ks):
    result = ks._stop_daemon("com.host.test")
    assert result is False

This verifies that an OSError from subprocess.run — which can happen if launchctl is not found on the system, or if the subprocess API has a permission problem — returns False rather than propagating the exception. During a Level 4 shutdown, an unhandled exception in _stop_daemon would prevent the loop from continuing to the next daemon.

Token Revocation with Missing DB

def test_revoke_missing_db_returns_false(self, tmp_path, config):
    config.registry_db_path = str(tmp_path / "no_such_file.db")
    ks = KillSwitch(config)
    assert ks._revoke_all_tokens() is False

def test_revoke_no_db_path_returns_false(self, config):
    config.registry_db_path = None
    ks = KillSwitch(config)
    assert ks._revoke_all_tokens() is False

Two different failure modes: the config has a path but the file doesn't exist, versus the config has no path at all. Both must return False rather than raising. And importantly, this False propagates into result.success for Level 4, making the failure visible in the halt result.

Token Revocation with Corrupted Schema

def test_revoke_exception_returns_false(self, tmp_path, config):
    import sqlite3

    db_file = tmp_path / "bad_registry.db"
    conn = sqlite3.connect(str(db_file))
    conn.execute("CREATE TABLE agent_registry (agent_id TEXT PRIMARY KEY)")
    # Missing auth_token_hash column — UPDATE will fail
    conn.execute("INSERT INTO agent_registry VALUES (?)", ("foresight",))
    conn.commit()
    conn.close()

    config.registry_db_path = str(db_file)
    ks = KillSwitch(config)
    result = ks._revoke_all_tokens()
    assert result is False

This test creates a real SQLite database with the wrong schema. The UPDATE agent_registry SET auth_token_hash = ? will fail because the column doesn't exist. The test verifies that the exception is caught and False is returned — not that the exception propagates up to the Level 4 shutdown and aborts the entire sequence.

Env Locking with Permission Error

def test_lock_handles_os_error(self, ks):
    with patch("os.chmod", side_effect=OSError("perm denied")):
        with patch("pathlib.Path.exists", return_value=True):
            result = ks._lock_env_files()
            assert result is False

chmod can fail if the process doesn't own the file. The test verifies this returns False rather than raising, and that result.success in Level 4 will be False if env files cannot be locked.

Token Revocation Integration Test

The token revocation test is notable because it uses a real SQLite database, not a mock:

def test_revoke_returns_true(self, tmp_path, config):
    import sqlite3

    db_file = tmp_path / "registry.db"
    conn = sqlite3.connect(str(db_file))
    conn.execute("""
        CREATE TABLE agent_registry (
            agent_id TEXT PRIMARY KEY,
            auth_token_hash TEXT NOT NULL
        )
    """)
    conn.execute(
        "INSERT INTO agent_registry VALUES (?, ?)",
        ("foresight", "$2b$12$abcdef1234567890"),
    )
    conn.commit()
    conn.close()

    config.registry_db_path = str(db_file)
    ks = KillSwitch(config)
    result = ks._revoke_all_tokens()
    assert result is True

    # Verify the token hash was actually changed
    conn2 = sqlite3.connect(str(db_file))
    row = conn2.execute(
        "SELECT auth_token_hash FROM agent_registry WHERE agent_id='foresight'"
    ).fetchone()
    conn2.close()
    assert row is not None
    assert row[0] != "$2b$12$abcdef1234567890", (
        "Token hash should have been replaced"
    )

This test does not mock sqlite3. It creates a real database, populates it with a real row, runs token revocation, then reads back the database to verify the hash was actually changed. Mocking sqlite3 would give you confidence that the code calls the right methods — not confidence that the actual revocation works.

This is the difference between testing behavior and testing implementation. For safety code, you test behavior.

What 100% Coverage Doesn't Guarantee

100% line coverage means every line was executed by at least one test. It does not mean:

  • Every combination of state was tested
  • The tests themselves are correct
  • The mocks accurately reflect real system behavior
  • The timing and ordering assumptions hold in production

100% coverage is a floor, not a ceiling. The comprehensive test suite for the kill switch goes beyond line coverage — it tests behavior contracts, error propagation, state invariants, and integration with real databases.

The tests are the specification. When something breaks in production, the first question is: "Does a test for this scenario exist?" If not, the fix requires both the code change and a new test before the PR can merge.

Coverage enforces that tests exist. Thoughtful test design determines whether the tests are worth having.