Recovery Protocol
principal-start enforces a pre-restart checklist before any service comes back up. The recovery sequence, the precondition gates, and why quarterly drills are non-negotiable.
Stopping a system is the easy part. Any script can kill processes. The hard part is knowing when it is safe to restart — and starting in the right order without re-introducing the problem you just stopped.
principal-start encodes the recovery protocol. It enforces preconditions before any service comes back up. It starts services in the correct order. It tells you what to verify after startup. And it explicitly tells you that trading bots should run in dry-run mode for 30 minutes before live trading is re-enabled.
This lesson walks through every decision in that script.
The Precondition Architecture
The script's first job is not to start anything. It is to verify the environment is ready to accept services.
run_check() {
local name="$1"
local fn="$2"
if $fn; then
PASS=$((PASS + 1))
else
FAIL=$((FAIL + 1))
fi
}
run_check "NATS" check_nats
run_check "Watchdog" check_watchdog
run_check "Memory Service" check_memory_service
run_check "Trading Server" check_trading_server
Four checks, each with its own function. The results are collected and evaluated before any startup proceeds.
if [[ $FAIL -gt 0 ]]; then
echo "ERROR: ${FAIL} precondition(s) failed. Fix them before proceeding."
echo "See above for remediation steps."
log "Preconditions failed (${FAIL} failures). Aborting startup."
exit 1
fi
If any check fails, nothing starts. The operator sees exactly which checks failed and follows the remediation steps embedded in the check output.
The Four Preconditions
NATS
check_nats() {
if launchctl list 2>/dev/null | grep -q "com.host.nats"; then
pass "NATS is loaded (com.host.nats)"
return 0
else
fail "NATS not found in launchctl list. Start it first: launchctl start com.host.nats"
return 1
fi
}
NATS is a blocking precondition. If NATS is not loaded, the broker cannot start (it subscribes to NATS on init), the agents cannot receive messages, and the routing layer cannot function. Starting services without NATS running is not a degraded startup — it is a broken startup.
The check looks for the daemon in launchctl list, not for a port. This is intentional: if NATS is in launchctl but has crashed immediately, the port check would pass temporarily and then fail. The launchctl check tells you whether the process management layer knows about the service.
If this check fails, the operator runs launchctl start com.host.nats directly — the exact command is in the error output — and then re-runs principal-start.
Watchdog
check_watchdog() {
if launchctl list 2>/dev/null | grep -q "com.host.watchdog"; then
pass "Watchdog service is loaded (com.host.watchdog)"
return 0
else
fail "Watchdog service not found in launchctl list. Start it first: launchctl start com.host.watchdog"
return 1
fi
}
The watchdog service is a blocking precondition. After recovery, you need the watchdog running before you start services so that any service that crashes immediately after startup is detected and handled. Starting services before the watchdog is running means the window between service startup and watchdog coverage is unmeasured.
The watchdog service and NATS are protected daemons that survive all halt levels. If they are not in launchctl after a halt, something outside the normal halt sequence happened — hardware restart, manual daemon removal, OS update. The check surfaces this before it becomes a silent problem.
Semantic Memory Layer
check_memory_service() {
local status
status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 http://localhost:8002/health 2>/dev/null || echo "000")
if [[ "$status" == "200" ]]; then
pass "Semantic memory layer is responsive (HTTP 200)"
return 0
else
fail "Semantic memory layer not responsive at http://localhost:8002/health (got: HTTP ${status})"
warn "Check Docker: docker ps | grep memory"
return 1
fi
}
The semantic memory layer is a blocking precondition. Agents load boot context from it on startup. A broker that starts without the memory layer available will fail to assemble boot context packages, and agents that start will have empty memory — they will not know their current state, their recent history, or the context they need to operate correctly.
The check is a curl against the health endpoint with a 3-second max timeout. The || echo "000" fallback handles cases where curl itself fails (Docker not running, port not bound). The warn "Check Docker:" line tells the operator the most common cause and the first diagnostic command.
Trading Server
check_trading_server() {
if ssh -o ConnectTimeout=3 -o BatchMode=yes \
"${TRADING_SERVER_USER}@${TRADING_SERVER_IP}" echo ok 2>/dev/null | grep -q "ok"; then
pass "Trading server is reachable (${TRADING_SERVER_IP})"
return 0
else
warn "Trading server not reachable at ${TRADING_SERVER_IP} — trading daemons cannot be verified"
warn "Proceeding without trading server confirmation. Verify manually before enabling live trading."
return 0 # Non-blocking: we can start primary host services even if trading server is unreachable
fi
}
The trading server is a non-blocking precondition. Primary host services can start even if the trading server is unreachable — it hosts the trading bots but not the primary infrastructure. The operator is warned that trading server state is unverified and must not enable live trading until it is confirmed.
This non-blocking decision reflects operational reality: the trading server might be temporarily unreachable due to a network issue that has nothing to do with the halt. You should not hold up an entire recovery because one machine is temporarily off the local network.
The Startup Order
After all preconditions pass, the script prompts for confirmation and then starts services in a specific order:
# 1. Broker first
log "Starting broker..."
start_daemon "$BROKER_DAEMON"
# Give broker a moment to initialize before dependent services connect
sleep 2
# 2. Trading daemons
log "Starting trading daemons..."
for daemon in "${TRADING_DAEMONS[@]}"; do
start_daemon "$daemon"
done
# 3. Content/infra daemons
log "Starting content/infra daemons..."
for daemon in "${CONTENT_DAEMONS[@]}"; do
start_daemon "$daemon"
done
The broker starts first. It initializes the registry, establishes NATS subscriptions, and sets up the authority enforcement layer. The 2-second sleep gives it time to initialize before dependent services connect.
Trading daemons start second because they are the highest-priority services. After a halt that stopped trading, the primary recovery objective is restoring trading capability (after verification).
Content and infrastructure daemons start last. They are lower priority and have no dependencies on trading daemons.
Post-Startup Verification
The script provides explicit verification commands:
echo "Services started. Verify health:"
echo " curl http://localhost:8400/health # broker"
echo " curl http://localhost:8080/health # InDecision Engine"
echo " launchctl list | grep com.host # all daemons"
These are not hints. They are instructions. The operator runs all three before declaring recovery complete.
The InDecision Engine check uses 127.0.0.1 explicitly — not localhost. On macOS with some network configurations, localhost can resolve to IPv6 which does not match the IPv4 binding. Always use 127.0.0.1 for local service health checks.
The Dry-Run Mandate
echo "NOTE: Trading bots should run in DRY_RUN mode for 30 min before live."
echo "NOTE: Enable live trading explicitly via Principal UI after verification."
This is not a suggestion. After any halt at Level 3 or Level 4, trading bots must run in DRY_RUN mode for at least 30 minutes before live trading is re-enabled.
The reason is not primarily about the bots being misconfigured — it is about the market state having changed during the halt. A trading bot that resumes live trading immediately after a halt will enter markets with stale conviction signals, potentially degraded open positions, and no recent history to calibrate against. DRY_RUN gives it 30 minutes to catch up before real money is at risk.
The 30-minute window also serves as a monitoring period. If the system is going to misbehave, it will usually manifest within the first few minutes of operation. Observing DRY_RUN logs for half an hour before going live provides a meaningful safety check.
Quarterly Drills
A kill switch that has never been tested is an untested kill switch. Quarterly drills are the operational mandate for this system.
A drill proceeds as follows:
- Announce the drill on Discord to the operator's personal channel (not the bots' logs channel)
- Trigger Level 2 halt via
principal-halt 2 - Verify all four trading daemons stopped via
launchctl list | grep com.host - Verify the watchdog service and NATS are still running
- Run
principal-startand confirm all preconditions pass - Resume services and verify health endpoints
- Confirm DRY_RUN mode is active for trading bots
- After 30 minutes, re-enable live trading
- Log the drill in the shutdown audit log
The drill reveals:
- Whether the daemon lists in the script are still accurate (new services may have been added)
- Whether the SSH configuration to the trading server still works
- Whether the recovery time meets acceptable operational thresholds
- Whether anyone on the team (human or agent) would know what to do in a real incident
The first drill always finds something. The second drill usually finds the fix from the first drill. By the fourth drill, the process is routine.
When Recovery is Not Straightforward
Not every halt is clean. Some scenarios require additional steps before principal-start can succeed:
Token re-issuance. After a Level 4 halt, all agent auth tokens are revoked. Before agents can reconnect to the broker, their tokens must be re-issued. This is done via the broker's admin API after the broker starts.
Memory layer Docker restart. If the halt involved a machine restart, Docker services including the semantic memory layer must be brought up explicitly: docker compose -f ~/Documents/Dev/docker-compose.yml up -d memory-service
Trading server manual verification. If the Level 4 SSH halt failed, SSH to the trading server manually and run the stop commands before restarting primary host services that depend on trading server state.
Env file permissions. After Level 4, env files are set to chmod 000. They must be restored to 600 before any service that reads them can start: chmod 600 ~/.principal/.env ~/.config/openclaw/.env
These are not edge cases. They are the expected cleanup for Level 4 shutdowns. The quarterly drill should exercise them.