Building a Drift Detection System
From manual SSH checks to automated deployment sync. How to build a system that catches drift before your users do.
The Capstone: From Incident to System
The March 29, 2026 incident exposed a cluster of independent failures:
- A Docker service ran stale code for an unknown period (Lesson 233)
- Testing via
localhostmasked a broken Tailscale-interface instance (Lesson 237) - A
--no-cacherebuild was required three times because cached COPY layers hid permission fixes (Lesson 238) - Two Shiva instances ran simultaneously against the same wallet for hours (Lesson 239)
- Confirming the deployed version required SSH + docker exec + grep (Lesson 240)
None of these failures were catastrophic in isolation. Together, they represent a systemic problem: no automated system was watching the gap between what should be running and what was actually running.
This lesson builds that system from scratch.
The Evolution: How Drift Detection Matures
Infrastructure monitoring typically evolves through four stages. Most homelab and small-team setups stall at stage two.
Stage 1 — Manual SSH audits. "Let me SSH in and check." Reactive, slow, error-prone, only happens when something is already wrong. The March 29 state.
Stage 2 — Ad-hoc scripts. A bash script lives in ~/scripts/check-services.sh. It runs when you remember to run it. Better than nothing, but still reactive.
Stage 3 — Scheduled crons. The script runs on a timer — nightly, or after every deployment. Findings land in a log file or Discord channel. This catches drift within hours rather than days.
Stage 4 — Skill-based automation with remediation guardrails. An agent dispatches parallel probes to every machine, synthesizes findings with severity tiers, posts structured reports, and offers (but does not automatically execute) remediation for high-risk services. This is what the deployment-sync skill implements.
What to Detect: The Drift Surface Area
A complete drift detection system monitors across five categories. Miss any one and you have a blind spot.
1. Git Drift
Is the running code behind main? By how many commits? Which files changed?
Detection method: compare the running service's git SHA (from /health endpoint, see Lesson 240) against git rev-parse HEAD on the main branch of the canonical repository.
2. Docker Image Age
When was the running Docker image built? Is it based on a commit that has since been superseded?
Detection method: docker inspect <container> | grep Created. Compare against the last successful CI build timestamp.
3. Permission and Configuration Anomalies
Are any service files world-writable? Are config files readable by the wrong user? Are secrets files accidentally set to 644 instead of 600?
Detection method: find /app -type f -perm /o+w inside each container. Flag deviations from expected permission posture.
4. Zombie and Duplicate Processes
Are any services running multiple instances? Are there orphaned processes from previous sessions?
Detection method: pgrep -c <service-name> — a count greater than 1 is a duplicate. Cross-reference against known PIDs in lockfiles.
5. Failed and Degraded Services
Are launchd jobs in a failed state? Are health endpoints returning non-200?
Detection method: launchctl list | grep com.knox to check job status. Health endpoint polling for HTTP 200.
Architecture: The Four-Phase Model
A drift detection system has four phases. Each feeds the next.
Inventory → Probes → Comparison → Alerting → [Optional Remediation]
— The known state. A manifest of every service, where it runs, what version it should be, and what its health endpoint URL is. This is your ground truth. Without it, you cannot define "drift."
— Active queries against each machine and service. SSH commands, HTTP health checks, docker inspect calls. Probes gather the actual current state.
— Diff the inventory against the probe results. Every deviation is a finding. Each finding gets a severity tier: CRITICAL (trading service, immediate action), HIGH (data service, urgent), LOW (informational, non-urgent).
— Deliver findings to the operator with enough context to act. Not just "drift detected" but "akashic-records on knox-mac-mini is running commit a3f8c21 (2026-03-27), main is b9d1e44 (2026-03-29). 4 commits behind. Last changed: indexer.py — reindex logic updated."
The deployment-sync Skill: Real Implementation
After the March 29 incident, the deployment-sync skill was built to automate this process. It dispatches parallel audit agents to Mac Mini and Tesseract, gathers their state reports, and synthesizes a drift report.
The skill's high-level execution flow:
1. Load service manifest (inventory)
2. Dispatch Agent A → Mac Mini (SSH + health checks)
Dispatch Agent B → Tesseract (SSH + health checks) [parallel]
3. Both agents return structured state JSON
4. Synthesizer: diff state vs manifest → findings list
5. Severity-tier findings: CRITICAL / HIGH / LOW
6. Format drift report
7. Post to Discord #logs channel
8. For CRITICAL findings: page operator via OpenClaw event
9. For LOW/HIGH: await operator instruction before remediation
The manifest (inventory file):
{
"services": [
{
"name": "akashic-records",
"machine": "knox-mac-mini",
"host": "100.91.193.23",
"health_url": "http://100.91.193.23:8002/health",
"repo": "Invictus-Labs/akashic-records",
"launchd_label": null,
"docker_service": "akashic",
"risk_tier": "HIGH"
},
{
"name": "shiva",
"machine": "tesseract",
"host": "192.168.1.150",
"health_url": "http://192.168.1.150:8003/health",
"repo": "Invictus-Labs/shiva",
"launchd_label": "com.knox.shiva",
"docker_service": null,
"risk_tier": "CRITICAL"
}
]
}
The probe agent (simplified):
async def probe_service(service: dict) -> dict:
result = {
"name": service["name"],
"machine": service["machine"],
"findings": [],
}
# Health check
try:
resp = await httpx.get(service["health_url"], timeout=5)
health = resp.json()
result["running_sha"] = health.get("version", "unknown")
result["pid"] = health.get("pid")
result["uptime_seconds"] = health.get("uptime_seconds")
except Exception as e:
result["findings"].append({
"severity": "CRITICAL",
"type": "health_check_failed",
"detail": str(e),
})
return result
# Git drift check
expected_sha = get_latest_main_sha(service["repo"])
if result["running_sha"] != expected_sha:
commits_behind = count_commits_behind(result["running_sha"], expected_sha)
result["findings"].append({
"severity": "HIGH" if service["risk_tier"] != "CRITICAL" else "CRITICAL",
"type": "git_drift",
"detail": f"Running {result['running_sha']}, main is {expected_sha} ({commits_behind} commits behind)",
})
# Duplicate process check (via SSH)
if service.get("launchd_label"):
count = ssh_exec(service["machine"], f"pgrep -c -f {service['name']}")
if int(count.strip()) > 1:
result["findings"].append({
"severity": "CRITICAL",
"type": "duplicate_process",
"detail": f"{count.strip()} instances running simultaneously",
})
return result
Severity Tiers and Escalation
Not all drift is equal. A knowledge indexer running 2 commits behind is inconvenient. A trading bot running 3 days behind with duplicate instances is a financial emergency.
CRITICAL — Immediate human action required. Service manages money, external API state, or has duplicate processes. Examples:
- Any trading bot (Shiva, Foresight, Leverage) with any drift
- Any service with duplicate process detection
- Any health check that returns non-200
HIGH — Urgent but not immediate. Service is stale but not actively dangerous. Examples:
- Knowledge services (Akashic) running >1 day behind main
- Docker images older than 7 days on actively maintained repos
- Permission anomalies on config files
LOW — Monitor and schedule. Informational, no active risk. Examples:
- Services running 1-2 commits behind on low-frequency repos
- Services with uptime > 30 days (potential for accumulated memory leaks)
- Missing health endpoints on non-critical services
Scheduling: When to Run
Drift detection is most valuable when it runs proactively, not reactively.
Weekly scheduled scan: Run a full audit every Sunday evening before the trading week starts. Any accumulated drift from the prior week is caught before it affects live trading.
Post-merge trigger: After any PR merges to main in a service repo, trigger a deployment-sync run for that service specifically. This catches the "PR merged but deployment never happened" failure mode within minutes.
Post-incident: Any time a service behaves unexpectedly, run deployment-sync before investigating the code. Rule out infrastructure state before debugging logic.
# GitHub Actions trigger for post-merge audit
on:
push:
branches: [main]
jobs:
deployment-sync:
runs-on: ubuntu-latest
steps:
- name: Trigger drift check
run: |
curl -X POST http://100.91.193.23:18789/tools/invoke \
-H "Authorization: Bearer ${{ secrets.OPENCLAW_TOKEN }}" \
-d '{"tool": "deployment_sync", "args": {"service": "${{ github.repository }}"}}'
Tying It All Together: The Lesson Map
Every lesson in this track contributed a component to the drift detection system:
| Lesson | Topic | Component |
|---|---|---|
| 223 | Health checks as contracts | /health endpoint standard |
| 224 | Service inventory | Manifest format, ground truth |
| 225 | SSH probe patterns | Remote state gathering |
| 226 | Docker inspect | Image age detection |
| 227 | Two instances, one port | Interface-aware health checks |
| 228 | Docker cache | --no-cache and build verification |
| 229 | Duplicate processes | PID lockfile + pgrep -c check |
| 230 | Version observability | Git SHA baking, version surfaces |
| 231 | Drift detection system | Everything assembled |
The March 29 incident would have been caught before it affected Claude Code sessions if the drift detection system had been in place. The stale Akashic instance would have shown up as HIGH severity in the Sunday scan. The Shiva duplicate would have triggered a CRITICAL alert within minutes of the second process starting.
Guardrails: What Not to Automate
The temptation after building a drift detection system is to add auto-remediation: detect drift, auto-pull, auto-restart, done. Resist this for anything stateful.
Safe to automate:
- Alerting (Discord, OpenClaw events)
- Report generation
- Stale lockfile cleanup when the associated process is confirmed dead
- Restarting stateless services (log shippers, metrics collectors) with no active state
Requires human authorization:
- Restarting any trading bot
- Deploying code updates to services managing external API sessions
- Any remediation that could interrupt a transaction or data write in progress
- Deleting or archiving data that might be needed for forensics
Key Takeaways
- Drift detection evolves through four stages: ad-hoc SSH → scripts → scheduled crons → agent-dispatched automated audits. Most teams stall at stage two; the goal is stage four.
- The four-phase architecture (Inventory → Probes → Comparison → Alerting) is universal — it applies whether you have 2 services or 200.
- Severity tiers (CRITICAL / HIGH / LOW) prevent alert fatigue while ensuring trading services always get immediate human attention.
- Scheduling matters as much as the system itself: post-merge triggers catch deployment failures within minutes; weekly scans clear accumulated drift before it compounds.
- Auto-remediation is safe for stateless services and forbidden for trading bots — the system detects and humans decide.
What's Next
You have completed the Infrastructure Drift track. The nine lessons covered the full spectrum: from a single broken health check to a multi-machine automated audit system. The next track goes deeper into operational reliability, exploring how to build self-healing services, instrument distributed traces across multiple machines, and design for graceful degradation when individual components fail.