Ask Knox

The kill switch is not a feature. It is a commitment.

When you build a system that moves real money, runs 24/7, and operates with increasing autonomy — you are implicitly committing to having an answer to the question: "What do I do when something goes badly wrong?"

The Agent Broker's answer is a four-level kill switch.

Each level corresponds to a class of emergency. Each level has a precise scope, a precise implementation, and a precise set of things that survive it.

The design goal is stated directly in the code: must work at 2am from a phone with one hand.

The KillSwitch Class

class KillSwitch:
    """
    4-level kill switch system.
    Must work at 2am from a phone with one hand.
    """

    def __init__(self, config, halt_store=None):
        self.config = config
        self._active_level: int = 0
        self._halt_store = halt_store
        self._confirmation = getattr(
            config, "level_4_confirmation", "SHUTDOWN TRADING"
        )

The _active_level starts at zero (normal operation) and can only increase through normal halt operations. A Level 2 halt does not bring the level back to 0 after Level 3 — the level only decreases via an explicit resume() call after investigation and deliberate restart.

_halt_store provides persistence. On restart, the broker calls restore_from_db() to reload the halt state. This means a system that was at Level 3 when it crashed comes back up at Level 3 — it does not automatically resume trading.

Protected Daemons

Before walking through the levels, the two most important constants in the file:

PROTECTED_DAEMONS = {"com.operator.watchdog", "com.operator.nats"}

LEVEL_3_PROTECTED = {"com.operator.watchdog", "com.operator.nats", "com.operator.monitor"}

PROTECTED_DAEMONS is the hard floor — Watchdog Service and NATS survive every shutdown level, including Level 4. LEVEL_3_PROTECTED extends that set for Level 3 only, adding the Monitoring System (monitor). At Level 4, the monitoring daemon stops along with everything else; at Level 3 it survives so you can observe the system while it is frozen.

NATS is the message transport. After a halt, recovery requires agents to communicate. If NATS is down, the broker cannot send restart signals, agents cannot receive directives, and the recovery sequence cannot be orchestrated. NATS must survive so recovery is possible.

The watchdog service monitors which services are down after a halt. When you run broker-start and services come back up, it validates that they are healthy and restarts any that fail to start correctly. Without the watchdog, recovery is a manual process with no feedback loop.

Monitoring System is the observability layer. At Level 3 you need to investigate — dashboards, metrics, and log pipelines must be readable while everything else is frozen. At Level 4 (full stop) even the monitoring daemon is halted; observation happens out-of-band after recovery.

_stop_daemon() checks PROTECTED_DAEMONS (the all-levels floor) and Level 3 uses LEVEL_3_PROTECTED directly:

def _stop_daemon(self, daemon_name: str, protected: frozenset = PROTECTED_DAEMONS) -> bool:
    if daemon_name in protected:
        logger.info(f"Skipping protected daemon: {daemon_name}")
        return False
    # ... launchctl stop ...

The check happens at the lowest level so it cannot be bypassed by any path through the code.

Level 1: Asset Halt

def level_1_halt(self, assets: list[str]) -> HaltResult:
    """
    Level 1: Stop trading on specific assets.
    Does NOT stop daemons — sends halt signal per asset.
    """
    self._active_level = max(self._active_level, 1)
    logger.warning(f"KILL SWITCH Level 1: Halting assets {assets}")
    self._persist(1, reason=f"Asset halt: {assets}")
    return HaltResult(
        level=1,
        success=True,
        daemons_stopped=[],
    )

Level 1 is the scalpel. Something is wrong with a specific asset — a market is behaving oddly, there's an oracle issue, a position has moved against you in a way that doesn't make sense — and you need to stop trading that asset immediately without touching anything else.

Note what Level 1 does NOT do: it does not stop any daemons. The trading bots continue running. They continue trading other assets. Only the specified assets are halted, and the halt is communicated via a signal — not via killing processes.

The broker records the intent. The actual asset-level halt is bot-specific. Foresight has its own halt mechanism for individual markets. Sports Prediction Agent has its own. Level 1 via the broker is the directive; the bots implement the enforcement.

When to use it: An individual market has a data quality problem. One asset is spiking in a way that looks like an error. You want to freeze a single position while you investigate without disrupting the rest of the portfolio.

Level 2: Trading Halt

TRADING_DAEMONS = [
    "com.operator.foresight",
    "com.operator.sports-bot",
    "com.operator.signal-engine",
    "com.operator.perp-bot",
]

def level_2_halt(self) -> HaltResult:
    """
    Level 2: Stop all trading bots. Content/infra continue.
    """
    self._active_level = max(self._active_level, 2)
    logger.warning("KILL SWITCH Level 2: Halting all trading")
    self._persist(2, reason="Full trading halt")
    result = HaltResult(level=2, success=True)

    for daemon in TRADING_DAEMONS:
        stopped = self._stop_daemon(daemon)
        if stopped:
            result.daemons_stopped.append(daemon)
        else:
            result.daemons_failed.append(daemon)

    result.success = len(result.daemons_failed) == 0
    return result

Level 2 stops all four trading bots. Content pipelines continue. The Discord bot continues. Semantic Memory Layer continues. The infrastructure does not care that trading has stopped. Only the four revenue-generating bots are affected.

The success flag is strict: if any daemon fails to stop, success is False. A partial trading halt is not a safe state. If com.operator.foresight fails to stop, the system reports failure immediately so the operator knows manual intervention is required.

When to use it: A market-wide event — unexpected crypto crash, exchange outage, regulatory announcement — where you want to pull out of everything simultaneously while keeping the rest of the stack running. Also the appropriate response if you observe correlated unusual behavior across multiple bots.

Level 3: Agent Freeze

ALL_FLEET_DAEMONS = [
    "com.operator.foresight",
    "com.operator.sports-bot",
    "com.operator.signal-engine",
    "com.operator.perp-bot",
    "com.operator.indecision-bot",
    "com.operator.indecision-price-alerts",
    "com.operator.event-bus",
    "com.operator.agent-broker",
]

def level_3_halt(self, pin: Optional[str] = None) -> HaltResult:
    """
    Level 3: Freeze all agents except monitor, watchdog, and NATS.
    Operator must ACK within 10 minutes or escalate to Level 4 (operational protocol — not automated).
    """
    if self.config.kill_switch_pin and pin != self.config.kill_switch_pin:
        return HaltResult(
            level=3,
            success=False,
            error="Invalid PIN",
        )
    self._active_level = max(self._active_level, 3)
    logger.critical("KILL SWITCH Level 3: Agent freeze")
    self._persist(3, reason="Agent freeze")
    result = HaltResult(level=3, success=True)

    for daemon in ALL_FLEET_DAEMONS:
        stopped = self._stop_daemon(daemon, protected=LEVEL_3_PROTECTED)
        if stopped:
            result.daemons_stopped.append(daemon)
        else:
            result.daemons_failed.append(daemon)

    result.success = len(result.daemons_failed) == 0
    return result

Level 3 requires a PIN. Everything stops except Watchdog Service, NATS, and Monitoring System (the monitoring suite). Even the agent broker itself is in ALL_FLEET_DAEMONS — after Level 3, the broker is down. The kill switch is now operating without its primary orchestration layer, which is why the CLI fallback exists.

The 10-minute ACK window is an operational protocol (human procedure, not automated enforcement): after a Level 3, the operator has 10 minutes to confirm the situation and either resume or manually escalate to Level 4. This window prevents Level 3 from becoming a permanent frozen state while still giving enough time to assess.

When to use it: You've observed behavior that suggests a compromised agent, an authentication problem, or an attack on the system. You're not ready for nuclear, but you need everything stopped while you investigate. The infrastructure layer stays up; the agent layer goes dark.

Level 4: Full Stop

Level 4 is the only level that reaches beyond the production server to Tesseract — a separate machine that hosts the trading-engine daemons. Because Tesseract is a different host, it cannot be stopped with local launchctl calls; it is halted over SSH as part of the shutdown sequence.

def level_4_shutdown(
    self,
    confirmation: str,
    triggered_by: str = "knox",
    pin: Optional[str] = None,
) -> HaltResult:
    """
    Level 4: Full stop. Requires exact confirmation phrase.
    Stops production server + Tesseract. Revokes tokens. Locks env files.
    """
    if confirmation != self._confirmation:
        return HaltResult(
            level=4,
            success=False,
            error=(
                f"Invalid confirmation phrase. "
                f"Required: '{self._confirmation}'"
            ),
        )
    if self.config.kill_switch_pin and pin != self.config.kill_switch_pin:
        return HaltResult(
            level=4,
            success=False,
            error="Invalid PIN",
        )

    self._active_level = 4
    logger.critical(
        f"KILL SWITCH Level 4: FULL STOP by {triggered_by}"
    )
    self._persist(4, reason=f"Full stop by {triggered_by}")
    result = HaltResult(level=4, success=True)

    # Step 1: Stop all fleet daemons
    for daemon in ALL_FLEET_DAEMONS:
        stopped = self._stop_daemon(daemon)
        if stopped:
            result.daemons_stopped.append(daemon)
        else:
            result.daemons_failed.append(daemon)

    # Step 2: Halt Tesseract via SSH
    result.tesseract_halted = self._halt_tesseract()
    result.tesseract_unreachable = not result.tesseract_halted

    # Step 3: Revoke all tokens
    result.tokens_revoked = self._revoke_all_tokens()

    # Step 4: Lock env files
    result.env_locked = self._lock_env_files()

    result.success = (
        len(result.daemons_failed) == 0
        and result.tokens_revoked
        and result.env_locked
    )
    return result

Level 4 has two gates: the confirmation phrase ("SHUTDOWN TRADING", case-sensitive) and the PIN. Both must pass. This prevents accidental Level 4 from a misclick or an automated system.

The four steps run sequentially, but result.success requires only three of them: all fleet daemons stopped, tokens revoked, and env files locked. Tesseract reachability (Step 2) is recorded in the result but does not gate success:

Step 1 stops all fleet daemons in the defined order. The order matters — trading bots first, then supporting services, then the broker itself.

Step 2 halts Tesseract via a batched SSH command. TESSERACT_DAEMONS is defined in the module alongside TRADING_DAEMONS:

TESSERACT_DAEMONS = [
    "com.operator.trading-engine",
    "com.operator.trading-watchdog",
]

All Tesseract daemons get a single SSH session with chained stop commands:

stop_cmds = "; ".join(
    f"launchctl stop {d}" for d in TESSERACT_DAEMONS
)

This was a deliberate optimization — multiple separate SSH connections with 10-second timeouts each becomes one 15-second connection. If Tesseract is unreachable, this is noted but does not fail the shutdown. Fleet daemons stop regardless.

Step 3 revokes all agent auth tokens by replacing every agent's token hash in the registry database with a random bcrypt hash that will never match any outstanding bearer token. This means even if an agent somehow survived the launchctl stop, its next API call to the broker will be rejected. Token revocation is the belt to launchctl's suspenders.

Step 4 locks env files by chmod 000 on .gateway/.env, agent-broker/.env, and .agent/.env. No process can read credentials after this step.

When to use it: Active security incident, confirmed system compromise, or any situation where you need absolute certainty that nothing is running. The recovery process is manual and deliberate — Level 4 is designed to be hard to come back from quickly, so the threshold for triggering it should be correspondingly high.

The Level-Never-Decreases Property

self._active_level = max(self._active_level, 1)

Every level method uses max() when setting the active level (except Level 4, which sets it directly to 4). This means if you're at Level 3 and somehow trigger a Level 1 halt, the level stays at 3. You cannot accidentally de-escalate by triggering a lower-level action.

This is an invariant with a test:

class TestLevelProgression:
    def test_level_never_decreases(self, ks):
        with patch.object(ks, "_stop_daemon", return_value=True):
            ks.level_2_halt()
            assert ks.active_level == 2
            ks.level_1_halt(["BTC"])
            assert ks.active_level == 2  # stays at 2

To decrease the level, you call resume() explicitly after investigation. The level does not decrease on its own.