Ask Knox

Stopping a system is the easy part. Any script can kill processes. The hard part is knowing when it is safe to restart — and starting in the right order without re-introducing the problem you just stopped.

broker-start encodes the recovery protocol. It enforces preconditions before any service comes back up. It starts services in the correct order. It tells you what to verify after startup. And it explicitly tells you that trading bots should run in dry-run mode for 30 minutes before live trading is re-enabled.

This lesson walks through every decision in that script.

The Precondition Architecture

The script's first job is not to start anything. It is to verify the environment is ready to accept services.

run_check() {
    local name="$1"
    local fn="$2"
    if $fn; then
        PASS=$((PASS + 1))
    else
        FAIL=$((FAIL + 1))
    fi
}

run_check "NATS"            check_nats
run_check "Watchdog Service"           check_watchdog
run_check "Semantic Memory Layer" check_memory_service
run_check "Tesseract"       check_tesseract

Four checks, each with its own function. The results are collected and evaluated before any startup proceeds.

if [[ $FAIL -gt 0 ]]; then
    echo "ERROR: ${FAIL} precondition(s) failed. Fix them before proceeding."
    echo "See above for remediation steps."
    log "Preconditions failed (${FAIL} failures). Aborting startup."
    exit 1
fi

If any check fails, nothing starts. The operator sees exactly which checks failed and follows the remediation steps embedded in the check output.

The Four Preconditions

NATS

check_nats() {
    if launchctl list 2>/dev/null | grep -q "com.operator.nats"; then
        pass "NATS is loaded (com.operator.nats)"
        return 0
    else
        fail "NATS not found in launchctl list. Start it first: launchctl start com.operator.nats"
        return 1
    fi
}

NATS is a blocking precondition. If NATS is not loaded, the broker cannot start (it subscribes to NATS on init), the agents cannot receive messages, and the routing layer cannot function. Starting services without NATS running is not a degraded startup — it is a broken startup.

The check looks for the daemon in launchctl list, not for a port. This is intentional: if NATS is in launchctl but has crashed immediately, the port check would pass temporarily and then fail. The launchctl check tells you whether the process management layer knows about the service.

If this check fails, the operator runs launchctl start com.operator.nats directly — the exact command is in the error output — and then re-runs broker-start.

Watchdog Service

check_watchdog() {
    if launchctl list 2>/dev/null | grep -q "com.operator.watchdog"; then
        pass "Watchdog Service is loaded (com.operator.watchdog)"
        return 0
    else
        fail "Watchdog Service not found in launchctl list. Start it first: launchctl start com.operator.watchdog"
        return 1
    fi
}

Watchdog Service is a blocking precondition. After recovery, you need the watchdog running before you start services so that any service that crashes immediately after startup is detected and handled. Starting services before Watchdog Service is running means the window between service startup and watchdog coverage is unmeasured.

Watchdog Service and NATS are protected daemons that survive all halt levels. If they are not in launchctl after a halt, something outside the normal halt sequence happened — hardware restart, manual daemon removal, macOS update. The check surfaces this before it becomes a silent problem.

Semantic Memory Layer

check_memory_service() {
    local status
    status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 http://localhost:8080/health 2>/dev/null || echo "000")
    if [[ "$status" == "200" ]]; then
        pass "Semantic Memory Layer is responsive (HTTP 200)"
        return 0
    else
        fail "Semantic Memory Layer not responsive at http://localhost:8080/health (got: HTTP ${status})"
        warn "Check Docker: docker ps | grep memory-service"
        return 1
    fi
}

Semantic Memory Layer is a blocking precondition. Agents load boot context from Semantic Memory Layer on startup. A broker that starts without Semantic Memory Layer available will fail to assemble boot context packages, and agents that start will have empty memory — they will not know their current state, their recent history, or the context they need to operate correctly.

The check is a curl against the health endpoint with a 3-second max timeout. The || echo "000" fallback handles cases where curl itself fails (Docker not running, port not bound). The warn "Check Docker:" line tells the operator the most common cause and the first diagnostic command.

Tesseract

check_tesseract() {
    if ssh -o ConnectTimeout=3 -o BatchMode=yes \
           "${TESSERACT_USER}@${TESSERACT_IP}" echo ok 2>/dev/null | grep -q "ok"; then
        pass "Tesseract is reachable (${TESSERACT_IP})"
        return 0
    else
        warn "Tesseract not reachable at ${TESSERACT_IP} — trading daemons cannot be verified"
        warn "Proceeding without Tesseract confirmation. Verify manually before enabling live trading."
        return 0  # Non-blocking: we can start production server services even if Tesseract is unreachable
    fi
}

Tesseract is a non-blocking precondition. Production server services can start even if Tesseract is unreachable — Tesseract hosts the trading bots but not the production server infrastructure. The operator is warned that Tesseract state is unverified and must not enable live trading until Tesseract is confirmed.

This non-blocking decision reflects operational reality: Tesseract might be temporarily unreachable due to a network issue that has nothing to do with the halt. You should not hold up an entire recovery because one machine is temporarily off the local network.

The Startup Order

After all preconditions pass, the script prompts for confirmation and then starts services in a specific order:

# 1. Broker first
log "Starting broker..."
start_daemon "$BROKER_DAEMON"
# Give broker a moment to initialize before dependent services connect
sleep 2

# 2. Trading daemons
log "Starting trading daemons..."
for daemon in "${TRADING_DAEMONS[@]}"; do
    start_daemon "$daemon"
done

# 3. Content/infra daemons
log "Starting content/infra daemons..."
for daemon in "${CONTENT_DAEMONS[@]}"; do
    start_daemon "$daemon"
done

The broker starts first. It initializes the registry, establishes NATS subscriptions, and sets up the authority enforcement layer. The 2-second sleep gives it time to initialize before dependent services connect.

Trading daemons start second because they are the highest-priority services. After a halt that stopped trading, the primary recovery objective is restoring trading capability (after verification).

Content and infrastructure daemons start last. They are lower priority and have no dependencies on trading daemons.

Post-Startup Verification

The script provides explicit verification commands:

echo "Services started. Verify health:"
echo "  curl http://localhost:8400/health        # broker"
echo "  curl http://127.0.0.1:8081/health        # InDecision Engine (replace with your own service equivalent)"
echo "  launchctl list | grep com.operator        # all daemons"

These are not hints. They are instructions. The operator runs all three before declaring recovery complete.

The InDecision Engine check uses 127.0.0.1 explicitly — not localhost. On macOS with some network configurations, localhost can resolve to IPv6 which does not match the IPv4 binding. Always use 127.0.0.1 for local service health checks.

The Dry-Run Mandate

echo "NOTE: Trading bots should run in DRY_RUN mode for 30 min before live."
echo "NOTE: Enable live trading explicitly via Agent Broker UI after verification."

This is not a suggestion. After any halt at Level 3 or Level 4, trading bots must run in DRY_RUN mode for at least 30 minutes before live trading is re-enabled.

The reason is not primarily about the bots being misconfigured — it is about the market state having changed during the halt. A trading bot that resumes live trading immediately after a halt will enter markets with stale conviction signals, potentially degraded open positions, and no recent history to calibrate against. DRY_RUN gives it 30 minutes to catch up before real money is at risk.

The 30-minute window also serves as a monitoring period. If the system is going to misbehave, it will usually manifest within the first few minutes of operation. Observing DRY_RUN logs for half an hour before going live provides a meaningful safety check.

Quarterly Drills

A kill switch that has never been tested is an untested kill switch. Quarterly drills are the operational mandate for this system.

A drill proceeds as follows:

Announce the drill on Discord to Knox's personal channel (not the bots' logs channel)
Trigger Level 2 halt via broker-halt 2
Verify all four trading daemons stopped via launchctl list | grep com.operator
Verify Watchdog Service and NATS are still running
Run broker-start and confirm all preconditions pass
Resume services and verify health endpoints
Confirm DRY_RUN mode is active for trading bots
After 30 minutes, re-enable live trading
Log the drill in the shutdown audit log

The drill reveals:

Whether the daemon lists in the script are still accurate (new services may have been added)
Whether the SSH configuration to Tesseract still works
Whether the recovery time meets acceptable operational thresholds
Whether anyone on the team (human or agent) would know what to do in a real incident

The first drill always finds something. The second drill usually finds the fix from the first drill. By the fourth drill, the process is routine.

When Recovery is Not Straightforward

Not every halt is clean. Some scenarios require additional steps before broker-start can succeed:

Token re-issuance. After a Level 4 halt, all agent auth tokens are revoked. Before agents can reconnect to the broker, their tokens must be re-issued. This is done via the broker's admin API after the broker starts.

Semantic Memory Layer Docker restart. If the halt involved a machine restart, Docker services including Semantic Memory Layer must be brought up explicitly: docker compose up -d memory-service

Tesseract manual verification. If the Level 4 SSH halt failed, SSH to Tesseract manually and run the stop commands before restarting production server services that depend on Tesseract state.

Env file permissions. After Level 4, env files are set to chmod 000. They must be restored to 600 before any service that reads them can start: chmod 600 ~/.gateway/.env agent-broker/.env ~/.agent/.env

These are not edge cases. They are the expected cleanup for Level 4 shutdowns. The quarterly drill should exercise them.

Recovery Protocol