Kill Switch Levels 1–4
Asset halt. Trading halt. Agent freeze. Nuclear. Four levels of response for four classes of emergency — and the real implementation behind each one.
The kill switch is not a feature. It is a commitment.
When you build a system that moves real money, runs 24/7, and operates with increasing autonomy — you are implicitly committing to having an answer to the question: "What do I do when something goes badly wrong?"
The Principal Broker's answer is a four-level kill switch. Each level corresponds to a class of emergency. Each level has a precise scope, a precise implementation, and a precise set of things that survive it.
The design goal is stated directly in the code: must work at 2am from a phone with one hand.
The KillSwitch Class
class KillSwitch:
"""
4-level kill switch system.
Must work at 2am from a phone with one hand.
"""
def __init__(self, config, halt_store=None):
self.config = config
self._active_level: int = 0
self._halt_store = halt_store
self._confirmation = getattr(
config, "level_4_confirmation", "SHUTDOWN INVICTUS"
)
The _active_level starts at zero (normal operation) and can only increase through normal halt operations. A Level 2 halt does not bring the level back to 0 after Level 3 — the level only decreases via an explicit resume() call after investigation and deliberate restart.
_halt_store provides persistence. On restart, the broker calls restore_from_db() to reload the halt state. This means a system that was at Level 3 when it crashed comes back up at Level 3 — it does not automatically resume trading.
Protected Daemons
Before walking through the levels, the most important constant in the file:
PROTECTED_DAEMONS = {"com.host.watchdog", "com.host.nats"}
The watchdog service and NATS survive every shutdown level, including Level 4. This is not an oversight. It is the architecture.
NATS is the message transport. After a halt, recovery requires agents to communicate. If NATS is down, the broker cannot send restart signals, agents cannot receive directives, and the recovery sequence cannot be orchestrated. NATS must survive so recovery is possible.
The watchdog service monitors which services are down after a halt. When you run principal-start and services come back up, the watchdog service validates that they are healthy and restarts any that fail to start correctly. Without it, recovery is a manual process with no feedback loop.
Every call to _stop_daemon() checks this list first:
def _stop_daemon(self, daemon_name: str) -> bool:
if daemon_name in PROTECTED_DAEMONS:
logger.info(f"Skipping protected daemon: {daemon_name}")
return False
# ... launchctl stop ...
The check happens at the lowest level so it cannot be bypassed by any path through the code.
Level 1: Asset Halt
def level_1_halt(self, assets: list[str]) -> HaltResult:
"""
Level 1: Stop trading on specific assets.
Does NOT stop daemons — sends halt signal per asset.
"""
self._active_level = max(self._active_level, 1)
logger.warning(f"KILL SWITCH Level 1: Halting assets {assets}")
self._persist(1, reason=f"Asset halt: {assets}")
return HaltResult(
level=1,
success=True,
daemons_stopped=[],
)
Level 1 is the scalpel. Something is wrong with a specific asset — a market is behaving oddly, there's an oracle issue, a position has moved against you in a way that doesn't make sense — and you need to stop trading that asset immediately without touching anything else.
Note what Level 1 does NOT do: it does not stop any daemons. The trading bots continue running. They continue trading other assets. Only the specified assets are halted, and the halt is communicated via a signal — not via killing processes.
The broker records the intent. The actual asset-level halt is bot-specific. Foresight has its own halt mechanism for individual markets. The sports prediction agent has its own. Level 1 via the broker is the directive; the bots implement the enforcement.
When to use it: An individual market has a data quality problem. One asset is spiking in a way that looks like an error. You want to freeze a single position while you investigate without disrupting the rest of the portfolio.
Level 2: Trading Halt
TRADING_DAEMONS = [
"com.host.foresight",
"com.host.sports-agent",
"com.host.political-agent",
"com.host.perpetuals-bot",
]
def level_2_halt(self) -> HaltResult:
"""
Level 2: Stop all trading bots. Content/infra continue.
"""
self._active_level = max(self._active_level, 2)
logger.warning("KILL SWITCH Level 2: Halting all trading")
self._persist(2, reason="Full trading halt")
result = HaltResult(level=2, success=True)
for daemon in TRADING_DAEMONS:
stopped = self._stop_daemon(daemon)
if stopped:
result.daemons_stopped.append(daemon)
else:
result.daemons_failed.append(daemon)
result.success = len(result.daemons_failed) == 0
return result
Level 2 stops all four trading bots. Content pipelines continue. The Discord bot continues. The semantic memory layer continues. The infrastructure does not care that trading has stopped. Only the four revenue-generating bots are affected.
The success flag is strict: if any daemon fails to stop, success is False. A partial trading halt is not a safe state. If com.host.foresight fails to stop, the system reports failure immediately so the operator knows manual intervention is required.
When to use it: A market-wide event — unexpected crypto crash, exchange outage, regulatory announcement — where you want to pull out of everything simultaneously while keeping the rest of the stack running. Also the appropriate response if you observe correlated unusual behavior across multiple bots.
Level 3: Agent Freeze
ALL_HOST_DAEMONS = [
"com.host.foresight",
"com.host.sports-agent",
"com.host.political-agent",
"com.host.perpetuals-bot",
"com.host.indecision-bot",
"com.host.indecision-price-alerts",
"com.host.djed",
"com.host.principal-broker",
]
def level_3_halt(self, pin: Optional[str] = None) -> HaltResult:
"""
Level 3: Freeze all agents except sentinel + watchdog.
Knox must ACK within 10 minutes or auto-Level 4.
"""
if self.config.kill_switch_pin and pin != self.config.kill_switch_pin:
return HaltResult(
level=3,
success=False,
error="Invalid PIN",
)
self._active_level = max(self._active_level, 3)
logger.critical("KILL SWITCH Level 3: Agent freeze")
self._persist(3, reason="Agent freeze")
result = HaltResult(level=3, success=True)
for daemon in ALL_HOST_DAEMONS:
stopped = self._stop_daemon(daemon)
if stopped:
result.daemons_stopped.append(daemon)
else:
result.daemons_failed.append(daemon)
result.success = len(result.daemons_failed) == 0
return result
Level 3 requires a PIN. Everything stops except the watchdog service, NATS, and Sentinel (the monitoring suite). Even the principal broker itself is in ALL_HOST_DAEMONS — after Level 3, the broker is down. The kill switch is now operating without its primary orchestration layer, which is why the CLI fallback exists.
The 10-minute ACK window is the operational protocol: after a Level 3, Knox has 10 minutes to confirm the situation and either resume or escalate to Level 4. This window prevents Level 3 from becoming a permanent frozen state while still giving enough time to assess.
When to use it: You've observed behavior that suggests a compromised agent, an authentication problem, or an attack on the system. You're not ready for nuclear, but you need everything stopped while you investigate. The infrastructure layer stays up; the agent layer goes dark.
Level 4: Full Stop
def level_4_shutdown(
self,
confirmation: str,
triggered_by: str = "knox",
pin: Optional[str] = None,
) -> HaltResult:
"""
Level 4: Full stop. Requires exact confirmation phrase.
Stops primary host + trading server. Revokes tokens. Locks env files.
"""
if confirmation != self._confirmation:
return HaltResult(
level=4,
success=False,
error=(
f"Invalid confirmation phrase. "
f"Required: '{self._confirmation}'"
),
)
if self.config.kill_switch_pin and pin != self.config.kill_switch_pin:
return HaltResult(
level=4,
success=False,
error="Invalid PIN",
)
self._active_level = 4
logger.critical(
f"KILL SWITCH Level 4: FULL STOP by {triggered_by}"
)
self._persist(4, reason=f"Full stop by {triggered_by}")
result = HaltResult(level=4, success=True)
# Step 1: Stop all primary host daemons
for daemon in ALL_HOST_DAEMONS:
stopped = self._stop_daemon(daemon)
if stopped:
result.daemons_stopped.append(daemon)
else:
result.daemons_failed.append(daemon)
# Step 2: Halt trading server via SSH
result.trading_server_halted = self._halt_trading_server()
result.trading_server_unreachable = not result.trading_server_halted
# Step 3: Revoke all tokens
result.tokens_revoked = self._revoke_all_tokens()
# Step 4: Lock env files
result.env_locked = self._lock_env_files()
result.success = (
len(result.daemons_failed) == 0
and result.tokens_revoked
and result.env_locked
)
return result
Level 4 has two gates: the confirmation phrase ("SHUTDOWN INVICTUS", case-sensitive) and the PIN. Both must pass. This prevents accidental Level 4 from a misclick or an automated system.
The four steps are sequential and all four must succeed for result.success to be True:
Step 1 stops all primary host daemons in the defined order. The order matters — trading bots first, then supporting services, then the broker itself.
Step 2 halts the trading server via a batched SSH command. All trading daemons on the trading server get a single SSH session with chained stop commands:
stop_cmds = "; ".join(
f"launchctl stop {d}" for d in TRADING_SERVER_DAEMONS
)
This was a deliberate optimization — four separate SSH connections with 10-second timeouts each becomes one 15-second connection. If the trading server is unreachable, this is noted but does not fail the shutdown. Primary host daemons stop regardless.
Step 3 revokes all agent auth tokens by replacing every agent's token hash in the registry database with a random bcrypt hash that will never match any outstanding bearer token. This means even if an agent somehow survived the launchctl stop, its next API call to the broker will be rejected. Token revocation is the belt to launchctl's suspenders.
Step 4 locks env files by chmod 000 on .principal/.env, principal-broker/.env, and .config/openclaw/.env. No process can read credentials after this step.
When to use it: Active security incident, confirmed system compromise, or any situation where you need absolute certainty that nothing is running. The recovery process is manual and deliberate — Level 4 is designed to be hard to come back from quickly, so the threshold for triggering it should be correspondingly high.
The Level-Never-Decreases Property
self._active_level = max(self._active_level, 1)
Every level method uses max() when setting the active level (except Level 4, which sets it directly to 4). This means if you're at Level 3 and somehow trigger a Level 1 halt, the level stays at 3. You cannot accidentally de-escalate by triggering a lower-level action.
This is an invariant with a test:
class TestLevelProgression:
def test_level_never_decreases(self, ks):
with patch.object(ks, "_stop_daemon", return_value=True):
ks.level_2_halt()
assert ks.active_level == 2
ks.level_1_halt(["BTC"])
assert ks.active_level == 2 # stays at 2
To decrease the level, you call resume() explicitly after investigation. The level does not decrease on its own.