Ask Knox

Every safety system contains an assumption: that its infrastructure is available when you need it.

The Agent Broker's kill switch works through a Python class, FastAPI endpoints, NATS subscriptions, and database connections. Under normal conditions, this is fine. But "when something is badly wrong" is precisely the scenario where your infrastructure is least likely to be fully operational.

The broker could be the thing that crashed. The broker could be the thing that's misbehaving. The broker could be hanging while the trading bots continue running with no oversight. In any of these cases, triggering the kill switch through the broker is not an option.

broker-halt exists for this. It is a bash script. It has no dependencies beyond launchctl, ssh, and sqlite3 — tools that are always available on a macOS system. It implements the same four-level halt sequence as the Python kill switch, independently, with no shared code.

Architecture of the Fallback

The script's daemon lists are defined in pure bash arrays, explicitly synchronized with the Python module:

# ---- Daemon lists (sourced from broker/safety/kill_switch.py) ----------------

TRADING_DAEMONS=(
    "com.operator.foresight"
    "com.operator.sports-bot"
    "com.operator.signal-engine"
    "com.operator.perp-bot"
)

ALL_FLEET_DAEMONS=(
    "com.operator.foresight"
    "com.operator.sports-bot"
    "com.operator.signal-engine"
    "com.operator.perp-bot"
    "com.operator.indecision-bot"
    "com.operator.indecision-price-alerts"
    "com.operator.event-bus"
    "com.operator.agent-broker"
)

# These survive ALL shutdown levels, including Level 4
PROTECTED_DAEMONS=(
    "com.operator.watchdog"
    "com.operator.nats"
)

# These additionally survive Level 3 (monitor keeps observability up during freeze)
LEVEL_3_PROTECTED=(
    "com.operator.watchdog"
    "com.operator.nats"
    "com.operator.monitor"
)

This duplication is intentional. The script does not import from the Python module. It does not call the Python module. It does not curl the broker. It is completely standalone. The cost of this is that when the daemon lists change in Python, they must be updated in the script too. This is tracked via the comment "sourced from broker/safety/kill_switch.py" — it is a maintenance obligation, not an oversight.

The protected daemons check is implemented in pure bash. stop_daemon accepts an optional nameref to the protected list so Level 3 can pass LEVEL_3_PROTECTED while Levels 2 and 4 use the base PROTECTED_DAEMONS:

is_protected() {
    local daemon="$1"
    local -n _list="${2:-PROTECTED_DAEMONS}"
    for p in "${_list[@]}"; do
        [[ "$p" == "$daemon" ]] && return 0
    done
    return 1
}

stop_daemon() {
    local daemon="$1"
    local list_name="${2:-PROTECTED_DAEMONS}"
    if is_protected "$daemon" "$list_name"; then
        log "  SKIP (protected): $daemon"
        return 0
    fi
    if launchctl stop "$daemon" 2>/dev/null; then
        log "  STOPPED: $daemon"
        return 0
    else
        log "  WARN: launchctl stop $daemon returned non-zero (may not be loaded)"
        return 0
    fi
}

Level 3 calls stop_daemon "$d" LEVEL_3_PROTECTED; Levels 2 and 4 call stop_daemon "$d" (defaulting to PROTECTED_DAEMONS). This keeps the Monitoring System running during an investigation freeze while halting it on a full Level 4 stop.

Notice that stop_daemon returns 0 even when launchctl stop fails. This is intentional. A non-zero return from launchctl stop typically means the daemon was not loaded — which is fine during an emergency halt. The important thing is that the script doesn't exit early on a warning. It logs it and continues to the next daemon.

Audit Trail Without a Database

The Python kill switch persists halt state to SQLite via HaltStateStore. The CLI cannot depend on that. Instead, it writes to a simple append-only text file:

AUDIT_LOG="${HOME}/.gateway/shutdown_audit.log"

write_audit() {
    local level="$1"
    local detail="$2"
    ensure_audit_dir
    printf '%s | LEVEL %d | manual-cli | %s\n' "$(ts)" "$level" "$detail" \
        >> "$AUDIT_LOG"
}

The format is: timestamp, level, source (manual-cli so it's distinguishable from broker-triggered halts), and detail. This file is not structured data — it is a plain text record that can be read with cat from anywhere without needing database access.

The ensure_audit_dir call creates the directory if it does not exist. This handles the case where the script is run before the broker has ever initialized its state directory.

Level 1: Intent Logging

The Level 1 implementation is where the CLI fallback is most honest about its limitations:

level_1() {
    shift  # remove the level arg
    local assets=("$@")
    if [[ ${#assets[@]} -eq 0 ]]; then
        echo "Usage: broker-halt 1 <asset1> [asset2 ...]" >&2
        exit 1
    fi
    local asset_list
    asset_list=$(IFS=,; echo "${assets[*]}")
    log "Level 1: halting assets [${asset_list}]"
    log "NOTE: Asset-level halts require bot-specific commands. This event has"
    log "      been logged. Send halt directives to each bot manually if broker"
    log "      is unavailable, or use: curl -X POST http://localhost:8400/halt"
    log "      with {\"level\": 1, \"assets\": [\"${asset_list}\"]}"
    write_audit 1 "Asset halt requested: ${asset_list}"
    log "Level 1 logged successfully."
}

Level 1 in the CLI does not actually stop asset trading — it logs the intent and provides the operator with the correct curl command if the broker is reachable. Asset-level halts are bot-specific mechanisms that cannot be replicated in pure bash without calling each bot's own API.

This is the right tradeoff. The CLI correctly identifies what it cannot do and gives the operator the information to do it manually. Claiming to halt assets when it cannot actually do so would be worse than being transparent about the limitation.

Level 4: Token Revocation in Bash

The Python _revoke_all_tokens() uses bcrypt to generate a properly-formatted invalid hash. Bash does not have bcrypt. The script generates a random hex string instead:

# Step 3: Revoke agent tokens by zeroing registry DB auth hashes
log "--- Step 3: Revoking agent tokens ---"
if [[ -f "$REGISTRY_DB" ]]; then
    # Generate a random 64-char hex string to overwrite all token hashes.
    # This is not a valid bcrypt hash, so all bearer token validations fail.
    local poison_hash
    poison_hash=$(LC_ALL=C tr -dc 'a-f0-9' < /dev/urandom 2>/dev/null | head -c 64 || true)
    if [[ -z "$poison_hash" ]]; then
        # Fallback: use date + PID for entropy
        poison_hash=$(printf '%s-%d-REVOKED' "$(ts)" "$$" | shasum | cut -c1-64 || echo "REVOKED-BY-BROKER-HALT-LEVEL4")
    fi
    if sqlite3 "$REGISTRY_DB" \
        "UPDATE agent_registry SET auth_token_hash = '${poison_hash}' WHERE 1=1;" 2>/dev/null; then
        log "  Agent tokens revoked in registry DB."
    else
        log "  WARN: Could not update registry DB at ${REGISTRY_DB}. Tokens may still be valid."
    fi
else
    log "  NOTE: Registry DB not found at ${REGISTRY_DB} — broker not yet installed or path differs."
fi

A random 64-character hex string is not a valid bcrypt hash format. Any token validation that checks the stored hash against a presented bearer token will fail. The tokens are effectively revoked.

The fallback for entropy generation (shasum of timestamp and PID) exists because /dev/urandom with tr can occasionally fail in constrained environments. The double fallback (echo "REVOKED-BY-BROKER-HALT-LEVEL4") is a last resort — even a constant string will invalidate all tokens since it will never match a valid bcrypt hash.

The Confirmation Gate

Level 4 requires typing the exact phrase:

level_4() {
    echo ""
    echo "WARNING: AGENT BROKER LEVEL 4 SHUTDOWN"
    echo "This will stop ALL services on the production server and Tesseract."
    echo "Agent auth tokens will be revoked."
    echo ""
    printf 'Type "SHUTDOWN TRADING" to confirm: '
    read -r confirm
    if [[ "$confirm" != "SHUTDOWN TRADING" ]]; then
        echo "Cancelled."
        exit 0
    fi
    # ...
}

The phrase matches the Python kill switch. In a 2am emergency, having the same phrase across both interfaces reduces cognitive load. You don't have to remember "which system uses which phrase" — it is always "SHUTDOWN TRADING."

Tesseract Fallback Instructions

If Tesseract is unreachable during Level 4:

if ssh -o ConnectTimeout=5 -o BatchMode=yes \
       "${TESSERACT_USER}@${TESSERACT_IP}" \
       "$tesseract_cmds" 2>/dev/null; then
    log "  Tesseract daemons stopped."
else
    log "  WARN: Tesseract unreachable or SSH failed. Manual halt required."
    log "  SSH to ${TESSERACT_IP} and run: ${tesseract_cmds}"
fi

The script logs the exact command string needed to halt Tesseract manually. When you get the warning, you open a new terminal, SSH to Tesseract yourself, and run what the log tells you to run. No guessing required.

The Log File

All output goes to /tmp/broker-halt.log via tee. This means output is visible on screen in real time and simultaneously written to disk. If you're triggering this from a phone screen with poor visibility, the log file is there when you get back to a proper terminal.

LOG=/tmp/broker-halt.log

log() {
    local msg="[$(ts)] $*"
    echo "$msg" | tee -a "$LOG"
}

The timestamp format is ISO 8601 UTC (date -u "+%Y-%m-%dT%H:%M:%SZ"). Every log line is timestamped. The post-incident review will have a precise timeline without any ambiguity about local vs. UTC time.

Why This Exists Separately

The CLI fallback would be unnecessary if the broker were perfectly reliable. But reliability is not the right frame. The frame is: under what conditions will you need to halt the system, and are those conditions correlated with the broker being available?

If the broker has a bug that causes it to authorize bad trades, the broker is the problem. If the broker is consuming 100% CPU and hanging, the broker is unavailable. If the broker is getting DOS'd by a misbehaving agent flooding the message queue, the broker might not respond.

In all of these scenarios, the answer to "can you still halt the system?" needs to be yes. broker-halt is that yes.

Zero-dependency safety infrastructure is not over-engineering. It is the only kind that works when it actually needs to.

The CLI Fallback