ASK KNOX
beta
LESSON 241

Building a Drift Detection System

From manual SSH checks to automated deployment sync. How to build a system that catches drift before your users do.

12 min read

The Capstone: From Incident to System

The March 29, 2026 incident exposed a cluster of independent failures:

  • A Docker service ran stale code for an unknown period (Lesson 233)
  • Testing via localhost masked a broken Tailscale-interface instance (Lesson 237)
  • A --no-cache rebuild was required three times because cached COPY layers hid permission fixes (Lesson 238)
  • Two Shiva instances ran simultaneously against the same wallet for hours (Lesson 239)
  • Confirming the deployed version required SSH + docker exec + grep (Lesson 240)

None of these failures were catastrophic in isolation. Together, they represent a systemic problem: no automated system was watching the gap between what should be running and what was actually running.

This lesson builds that system from scratch.


The Evolution: How Drift Detection Matures

Infrastructure monitoring typically evolves through four stages. Most homelab and small-team setups stall at stage two.

Stage 1 — Manual SSH audits. "Let me SSH in and check." Reactive, slow, error-prone, only happens when something is already wrong. The March 29 state.

Stage 2 — Ad-hoc scripts. A bash script lives in ~/scripts/check-services.sh. It runs when you remember to run it. Better than nothing, but still reactive.

Stage 3 — Scheduled crons. The script runs on a timer — nightly, or after every deployment. Findings land in a log file or Discord channel. This catches drift within hours rather than days.

Stage 4 — Skill-based automation with remediation guardrails. An agent dispatches parallel probes to every machine, synthesizes findings with severity tiers, posts structured reports, and offers (but does not automatically execute) remediation for high-risk services. This is what the deployment-sync skill implements.


What to Detect: The Drift Surface Area

A complete drift detection system monitors across five categories. Miss any one and you have a blind spot.

1. Git Drift

Is the running code behind main? By how many commits? Which files changed?

Detection method: compare the running service's git SHA (from /health endpoint, see Lesson 240) against git rev-parse HEAD on the main branch of the canonical repository.

2. Docker Image Age

When was the running Docker image built? Is it based on a commit that has since been superseded?

Detection method: docker inspect <container> | grep Created. Compare against the last successful CI build timestamp.

3. Permission and Configuration Anomalies

Are any service files world-writable? Are config files readable by the wrong user? Are secrets files accidentally set to 644 instead of 600?

Detection method: find /app -type f -perm /o+w inside each container. Flag deviations from expected permission posture.

4. Zombie and Duplicate Processes

Are any services running multiple instances? Are there orphaned processes from previous sessions?

Detection method: pgrep -c <service-name> — a count greater than 1 is a duplicate. Cross-reference against known PIDs in lockfiles.

5. Failed and Degraded Services

Are launchd jobs in a failed state? Are health endpoints returning non-200?

Detection method: launchctl list | grep com.knox to check job status. Health endpoint polling for HTTP 200.


Architecture: The Four-Phase Model

A drift detection system has four phases. Each feeds the next.

Inventory → Probes → Comparison → Alerting → [Optional Remediation]

— The known state. A manifest of every service, where it runs, what version it should be, and what its health endpoint URL is. This is your ground truth. Without it, you cannot define "drift."

— Active queries against each machine and service. SSH commands, HTTP health checks, docker inspect calls. Probes gather the actual current state.

— Diff the inventory against the probe results. Every deviation is a finding. Each finding gets a severity tier: CRITICAL (trading service, immediate action), HIGH (data service, urgent), LOW (informational, non-urgent).

— Deliver findings to the operator with enough context to act. Not just "drift detected" but "akashic-records on knox-mac-mini is running commit a3f8c21 (2026-03-27), main is b9d1e44 (2026-03-29). 4 commits behind. Last changed: indexer.py — reindex logic updated."


The deployment-sync Skill: Real Implementation

After the March 29 incident, the deployment-sync skill was built to automate this process. It dispatches parallel audit agents to Mac Mini and Tesseract, gathers their state reports, and synthesizes a drift report.

The skill's high-level execution flow:

1. Load service manifest (inventory)
2. Dispatch Agent A → Mac Mini (SSH + health checks)
   Dispatch Agent B → Tesseract (SSH + health checks)  [parallel]
3. Both agents return structured state JSON
4. Synthesizer: diff state vs manifest → findings list
5. Severity-tier findings: CRITICAL / HIGH / LOW
6. Format drift report
7. Post to Discord #logs channel
8. For CRITICAL findings: page operator via OpenClaw event
9. For LOW/HIGH: await operator instruction before remediation

The manifest (inventory file):

{
  "services": [
    {
      "name": "akashic-records",
      "machine": "knox-mac-mini",
      "host": "100.91.193.23",
      "health_url": "http://100.91.193.23:8002/health",
      "repo": "Invictus-Labs/akashic-records",
      "launchd_label": null,
      "docker_service": "akashic",
      "risk_tier": "HIGH"
    },
    {
      "name": "shiva",
      "machine": "tesseract",
      "host": "192.168.1.150",
      "health_url": "http://192.168.1.150:8003/health",
      "repo": "Invictus-Labs/shiva",
      "launchd_label": "com.knox.shiva",
      "docker_service": null,
      "risk_tier": "CRITICAL"
    }
  ]
}

The probe agent (simplified):

async def probe_service(service: dict) -> dict:
    result = {
        "name": service["name"],
        "machine": service["machine"],
        "findings": [],
    }

    # Health check
    try:
        resp = await httpx.get(service["health_url"], timeout=5)
        health = resp.json()
        result["running_sha"] = health.get("version", "unknown")
        result["pid"] = health.get("pid")
        result["uptime_seconds"] = health.get("uptime_seconds")
    except Exception as e:
        result["findings"].append({
            "severity": "CRITICAL",
            "type": "health_check_failed",
            "detail": str(e),
        })
        return result

    # Git drift check
    expected_sha = get_latest_main_sha(service["repo"])
    if result["running_sha"] != expected_sha:
        commits_behind = count_commits_behind(result["running_sha"], expected_sha)
        result["findings"].append({
            "severity": "HIGH" if service["risk_tier"] != "CRITICAL" else "CRITICAL",
            "type": "git_drift",
            "detail": f"Running {result['running_sha']}, main is {expected_sha} ({commits_behind} commits behind)",
        })

    # Duplicate process check (via SSH)
    if service.get("launchd_label"):
        count = ssh_exec(service["machine"], f"pgrep -c -f {service['name']}")
        if int(count.strip()) > 1:
            result["findings"].append({
                "severity": "CRITICAL",
                "type": "duplicate_process",
                "detail": f"{count.strip()} instances running simultaneously",
            })

    return result

Severity Tiers and Escalation

Not all drift is equal. A knowledge indexer running 2 commits behind is inconvenient. A trading bot running 3 days behind with duplicate instances is a financial emergency.

CRITICAL — Immediate human action required. Service manages money, external API state, or has duplicate processes. Examples:

  • Any trading bot (Shiva, Foresight, Leverage) with any drift
  • Any service with duplicate process detection
  • Any health check that returns non-200

HIGH — Urgent but not immediate. Service is stale but not actively dangerous. Examples:

  • Knowledge services (Akashic) running >1 day behind main
  • Docker images older than 7 days on actively maintained repos
  • Permission anomalies on config files

LOW — Monitor and schedule. Informational, no active risk. Examples:

  • Services running 1-2 commits behind on low-frequency repos
  • Services with uptime > 30 days (potential for accumulated memory leaks)
  • Missing health endpoints on non-critical services

Scheduling: When to Run

Drift detection is most valuable when it runs proactively, not reactively.

Weekly scheduled scan: Run a full audit every Sunday evening before the trading week starts. Any accumulated drift from the prior week is caught before it affects live trading.

Post-merge trigger: After any PR merges to main in a service repo, trigger a deployment-sync run for that service specifically. This catches the "PR merged but deployment never happened" failure mode within minutes.

Post-incident: Any time a service behaves unexpectedly, run deployment-sync before investigating the code. Rule out infrastructure state before debugging logic.

# GitHub Actions trigger for post-merge audit
on:
  push:
    branches: [main]

jobs:
  deployment-sync:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger drift check
        run: |
          curl -X POST http://100.91.193.23:18789/tools/invoke \
            -H "Authorization: Bearer ${{ secrets.OPENCLAW_TOKEN }}" \
            -d '{"tool": "deployment_sync", "args": {"service": "${{ github.repository }}"}}'

Tying It All Together: The Lesson Map

Every lesson in this track contributed a component to the drift detection system:

LessonTopicComponent
223Health checks as contracts/health endpoint standard
224Service inventoryManifest format, ground truth
225SSH probe patternsRemote state gathering
226Docker inspectImage age detection
227Two instances, one portInterface-aware health checks
228Docker cache--no-cache and build verification
229Duplicate processesPID lockfile + pgrep -c check
230Version observabilityGit SHA baking, version surfaces
231Drift detection systemEverything assembled

The March 29 incident would have been caught before it affected Claude Code sessions if the drift detection system had been in place. The stale Akashic instance would have shown up as HIGH severity in the Sunday scan. The Shiva duplicate would have triggered a CRITICAL alert within minutes of the second process starting.


Guardrails: What Not to Automate

The temptation after building a drift detection system is to add auto-remediation: detect drift, auto-pull, auto-restart, done. Resist this for anything stateful.

Safe to automate:

  • Alerting (Discord, OpenClaw events)
  • Report generation
  • Stale lockfile cleanup when the associated process is confirmed dead
  • Restarting stateless services (log shippers, metrics collectors) with no active state

Requires human authorization:

  • Restarting any trading bot
  • Deploying code updates to services managing external API sessions
  • Any remediation that could interrupt a transaction or data write in progress
  • Deleting or archiving data that might be needed for forensics

Key Takeaways

  • Drift detection evolves through four stages: ad-hoc SSH → scripts → scheduled crons → agent-dispatched automated audits. Most teams stall at stage two; the goal is stage four.
  • The four-phase architecture (Inventory → Probes → Comparison → Alerting) is universal — it applies whether you have 2 services or 200.
  • Severity tiers (CRITICAL / HIGH / LOW) prevent alert fatigue while ensuring trading services always get immediate human attention.
  • Scheduling matters as much as the system itself: post-merge triggers catch deployment failures within minutes; weekly scans clear accumulated drift before it compounds.
  • Auto-remediation is safe for stateless services and forbidden for trading bots — the system detects and humans decide.

What's Next

You have completed the Infrastructure Drift track. The nine lessons covered the full spectrum: from a single broken health check to a multi-machine automated audit system. The next track goes deeper into operational reliability, exploring how to build self-healing services, instrument distributed traces across multiple machines, and design for graceful degradation when individual components fail.