Building a Drift Detection System

The Capstone: From Incident to System

The March 29, 2026 incident exposed a cluster of independent failures:

A Docker service ran stale code for an unknown period (the “Code Merge Is Not Deployment” lesson — Code Merge Is Not Deployment)
Testing via localhost masked a broken Tailscale-interface instance (the “Two Instances, One Port” lesson)
Two standard rebuilds failed because cached COPY layers hid the permission fix; a --no-cache rebuild on the third attempt succeeded (the “Docker Cache Hides Fixes” lesson)
Two Sports Prediction Agent instances ran simultaneously against the same wallet for hours (the “Duplicate Services Equal Financial Risk” lesson)
Confirming the deployed version required SSH + docker exec + grep (the previous lesson)

None of these failures were catastrophic in isolation. Together, they represent a systemic problem: no automated system was watching the gap between what should be running and what was actually running.

This lesson builds that system from scratch.

The Evolution: How Drift Detection Matures

Infrastructure monitoring typically evolves through four stages. Most homelab and small-team setups stall at stage two.

Stage 1 — Manual SSH audits. "Let me SSH in and check." Reactive, slow, error-prone, only happens when something is already wrong. The March 29 state.

Stage 2 — Ad-hoc scripts. A bash script lives in ~/scripts/check-services.sh. It runs when you remember to run it. Better than nothing, but still reactive.

Stage 3 — Scheduled crons. The script runs on a timer — nightly, or after every deployment. Findings land in a log file or Discord channel. This catches drift within hours rather than days.

Stage 4 — Skill-based automation with remediation guardrails. An agent dispatches parallel probes to every machine, synthesizes findings with severity tiers, posts structured reports, and offers (but does not automatically execute) remediation for high-risk services. This is what the deployment-sync skill implements.

What to Detect: The Drift Surface Area

A complete drift detection system monitors across five categories. Miss any one and you have a blind spot.

1. Git Drift

Is the running code behind main? By how many commits? Which files changed?

Detection method: compare the running service's git SHA (from /health endpoint, see the previous lesson) against git rev-parse HEAD on the main branch of the canonical repository.

2. Docker Image Age

When was the running Docker image built? Is it based on a commit that has since been superseded?

Detection method: docker inspect <container> | grep Created. Compare against the last successful CI build timestamp.

3. Permission and Configuration Anomalies

Are any service files world-writable? Are config files readable by the wrong user? Are secrets files accidentally set to 644 instead of 600?

Detection method: find /app -type f -perm /o+w inside each container. Flag deviations from expected permission posture.

4. Zombie and Duplicate Processes

Are any services running multiple instances? Are there orphaned processes from previous sessions?

Detection method: pgrep -c <service-name> — a count greater than 1 is a duplicate. Cross-reference against known PIDs in lockfiles.

5. Failed and Degraded Services

Are launchd jobs in a failed state? Are health endpoints returning non-200?

Detection method: launchctl list | grep com.operator to check job status. Health endpoint polling for HTTP 200.

Architecture: The Four-Phase Model

A drift detection system has four phases. Each feeds the next.

Inventory → Probes → Comparison → Alerting → [Optional Remediation]

— The known state. A manifest of every service, where it runs, what version it should be, and what its health endpoint URL is. This is your ground truth. Without it, you cannot define "drift."

— Active queries against each machine and service. SSH commands, HTTP health checks, docker inspect calls. Probes gather the actual current state.

— Diff the inventory against the probe results. Every deviation is a finding. Each finding gets a severity tier: CRITICAL (trading service, immediate action), HIGH (data service, urgent), LOW (informational, non-urgent).

— Deliver findings to the operator with enough context to act. Not just "drift detected" but "memory-service on prod-host is running commit a3f8c21 (2026-03-27), main is b9d1e44 (2026-03-29). 4 commits behind. Last changed: indexer.py — reindex logic updated."

The deployment-sync Skill: Real Implementation

After the March 29 incident, the deployment-sync skill was built to automate this process. It dispatches parallel audit agents to the production server and the compute node, gathers their state reports, and synthesizes a drift report.

The skill's high-level execution flow:

1. Load service manifest (inventory)
2. Dispatch Agent A → production server (SSH + health checks)
   Dispatch Agent B → compute node (SSH + health checks)  [parallel]
3. Both agents return structured state JSON
4. Synthesizer: diff state vs manifest → findings list
5. Severity-tier findings: CRITICAL / HIGH / LOW
6. Format drift report
7. Post to Discord #logs channel
8. For CRITICAL findings: page operator via Agent Gateway event
9. For LOW/HIGH: await operator instruction before remediation

The manifest (inventory file):

{
  "services": [
    {
      "name": "memory-service",
      "machine": "prod-host",
      "host": "198.51.100.5",
      "health_url": "http://198.51.100.5:8080/health",
      "repo": "your-org/memory-service",
      "launchd_label": null,
      "docker_service": "memory-service",
      "risk_tier": "HIGH"
    },
    {
      "name": "sports-bot",
      "machine": "tesseract",
      "host": "192.0.2.10",
      "health_url": "http://192.0.2.10:8003/health",
      "repo": "your-org/sports-bot",
      "launchd_label": "com.operator.sports-bot",
      "docker_service": null,
      "risk_tier": "CRITICAL"
    }
  ]
}

The probe agent (simplified):

async def probe_service(service: dict) -> dict:
    result = {
        "name": service["name"],
        "machine": service["machine"],
        "findings": [],
    }

    # Health check
    try:
        async with httpx.AsyncClient() as client:
            resp = await client.get(service["health_url"], timeout=5)
        health = resp.json()
        result["running_sha"] = health.get("version", "unknown")
        result["pid"] = health.get("pid")
        result["uptime_seconds"] = health.get("uptime_seconds")
    except Exception as e:
        result["findings"].append({
            "severity": "CRITICAL",
            "type": "health_check_failed",
            "detail": str(e),
        })
        return result

    # Git drift check
    expected_sha = get_latest_main_sha(service["repo"])
    if result["running_sha"] != expected_sha:
        commits_behind = count_commits_behind(result["running_sha"], expected_sha)
        result["findings"].append({
            "severity": "HIGH" if service["risk_tier"] != "CRITICAL" else "CRITICAL",
            "type": "git_drift",
            "detail": f"Running {result['running_sha']}, main is {expected_sha} ({commits_behind} commits behind)",
        })

    # Duplicate process check (via SSH)
    if service.get("launchd_label"):
        count = ssh_exec(service["machine"], f"pgrep -c -f {service['name']}")
        if int(count.strip()) > 1:
            result["findings"].append({
                "severity": "CRITICAL",
                "type": "duplicate_process",
                "detail": f"{count.strip()} instances running simultaneously",
            })

    return result

Severity Tiers and Escalation

Not all drift is equal. A knowledge indexer running 2 commits behind is inconvenient. A trading bot running 3 days behind with duplicate instances is a financial emergency.

CRITICAL — Immediate human action required. Service manages money, external API state, or has duplicate processes. Examples:

Any trading bot (the sports bot, the prediction bot, the perp bot) with any drift
Any service with duplicate process detection
Any health check that returns non-200

HIGH — Urgent but not immediate. Service is stale but not actively dangerous. Examples:

Knowledge services (Semantic Memory Layer) running >1 day behind main
Docker images older than 7 days on actively maintained repos
Permission anomalies on config files

LOW — Monitor and schedule. Informational, no active risk. Examples:

Services running 1-2 commits behind on low-frequency repos
Services with uptime > 30 days (potential for accumulated memory leaks)
Missing health endpoints on non-critical services

Scheduling: When to Run

Drift detection is most valuable when it runs proactively, not reactively.

Weekly scheduled scan: Run a full audit every Sunday evening before the trading week starts. Any accumulated drift from the prior week is caught before it affects live trading.

Post-merge trigger: After any PR merges to main in a service repo, trigger a deployment-sync run for that service specifically. This catches the "PR merged but deployment never happened" failure mode within minutes.

Post-incident: Any time a service behaves unexpectedly, run deployment-sync before investigating the code. Rule out infrastructure state before debugging logic.

# GitHub Actions trigger for post-merge audit
on:
  push:
    branches: [main]

jobs:
  deployment-sync:
    # MUST run on a self-hosted runner inside the overlay network — a hosted
    # ubuntu-latest runner cannot reach a private/overlay IP like 198.51.100.5
    # and the curl below would simply time out.
    runs-on: self-hosted
    steps:
      - name: Trigger drift check
        run: |
          curl -X POST http://198.51.100.5:8765/tools/invoke \
            -H "Authorization: Bearer ${{ secrets.OPENCLAW_TOKEN }}" \
            -d '{"tool": "deployment_sync", "args": {"service": "${{ github.repository }}"}}'

Tying It All Together: The Lesson Map

Every lesson in this track contributed a component to the drift detection system:

Lesson	Topic	Component
232	Code merge is not deployment	Deployment gap, stale-runtime detection
233	Health checks lie	Deep health contracts, endpoint verification
234	Drift is exponential	Compounding drift model, multi-machine scope
235	Permission drift from automation	umask, pre-build permission scanning
236	Ghost processes accumulate	Zombie/orphan/duplicate taxonomy, PID lockfiles
237	Two instances, one port	Interface-aware health checks
238	Docker cache hides fixes	`--no-cache` and build verification
239	Duplicate services and financial risk	PID lockfile + `pgrep -c` check
240	Version observability	Git SHA baking, version surfaces
241	Drift detection system	Everything assembled

The March 29 incident would have been caught before it affected Claude Code sessions if the drift detection system had been in place. The stale Semantic Memory Layer instance would have shown up as HIGH severity in the Sunday scan. The Sports Prediction Agent duplicate would have triggered a CRITICAL alert within minutes of the second process starting.

Guardrails: What Not to Automate

The temptation after building a drift detection system is to add auto-remediation: detect drift, auto-pull, auto-restart, done. Resist this for anything stateful.

Safe to automate:

Alerting (Discord, Agent Gateway events)
Report generation
Stale lockfile cleanup when the associated process is confirmed dead
Restarting stateless services (log shippers, metrics collectors) with no active state

Requires human authorization:

Restarting any trading bot
Deploying code updates to services managing external API sessions
Any remediation that could interrupt a transaction or data write in progress
Deleting or archiving data that might be needed for forensics

Key Takeaways

Drift detection evolves through four stages: ad-hoc SSH → scripts → scheduled crons → agent-dispatched automated audits. Most teams stall at stage two; the goal is stage four.
The four-phase architecture (Inventory → Probes → Comparison → Alerting) is universal — it applies whether you have 2 services or 200.
Severity tiers (CRITICAL / HIGH / LOW) prevent alert fatigue while ensuring trading services always get immediate human attention.
Scheduling matters as much as the system itself: post-merge triggers catch deployment failures within minutes; weekly scans clear accumulated drift before it compounds.
Auto-remediation is safe for stateless services and forbidden for trading bots — the system detects and humans decide.

What's Next

You have completed the Infrastructure Drift track. The ten lessons covered the full spectrum: from a single broken health check to a multi-machine automated audit system. Your capstone is the completion project — run a full drift audit on your own infrastructure: inventory every running service and compare its last deploy date against remote HEAD, audit which services expose contract-aware health endpoints, run a permission scan, and take a process census for ghosts. For each finding, record the resolution you took or why you deferred it. That audit is where these patterns stop being theory and start protecting a live fleet.