Ask Knox

Drift does not announce itself. It does not file a ticket or send a Slack message. It accumulates in the background — one skipped deploy, then another, then ten — until the gap between what is running and what is written is large enough that fixing it feels like a project.

The March 2026 infrastructure audit found eleven repos drifted simultaneously across two machines. None had triggered an alert. All were running services that appeared operational.

Why One Skip Becomes Ten

The first skipped deploy is usually innocent. You merge a small fix. You mean to rebuild the container. Something interrupts you. You log off. The service keeps running. The feature works — it is a small fix. You note it mentally and move on.

Twenty-four hours later, two more commits have merged. The gap is now three commits. Three commits feels like enough that a rebuild is still warranted, but you need a few minutes to be safe. You are in the middle of something else. You defer again.

A week later, the gap is twelve commits. Now you are not sure what all twelve commits contain. A rebuild feels risky. What if one of them changed an environment variable? What if there is a migration? You should review the diff first. That takes time you do not have right now.

This is not a discipline failure. It is a system design failure. When deploy is a manual act with no scheduled checkpoint, drift is the default outcome. The only question is how much accumulates before someone audits.

The March 2026 Audit: By the Numbers

The full picture, machine by machine:

Production server:

mission-control — 2 commits behind
watchdog — 6+ commits behind (the watchdog, not watching itself)
content-pipeline — 5+ commits behind
memory-service — PR #19, 8 days (the memory_search incident)
3 additional services with 1-3 commit gaps

Tesseract:

sports-bot (Sports Prediction Agent) — 3 commits behind (live trading bot on Polygon)
perp-bot (Perp-Futures Bot) — 5 commits behind (Phemex perps bot)
signal-engine (Signal Engine) — 2 commits behind (market data feed)
1 additional service with a minor gap

Eleven repos. Two machines. Zero alerts. The systems appeared operational because liveness checks passed. The code running was not the code written.

The trading bots on Tesseract are the most alarming. The sports-prediction bot and the perp-futures bot were both behind while actively placing orders. The stale code was executing against live markets.

The "Works on My Machine" Network Variant

The classic "works on my machine" problem refers to environment differences between a developer's laptop and production. The infrastructure drift variant is different: the code works on the machine where it was developed (the dev laptop, where git is current), but the machine running the service (the production server or a dedicated compute node) never received the update.

The developer's mental model has the new code deployed. Every test they run locally passes. The feature works when they test it. When a user (or another service) hits the remote endpoint, they are hitting the old code. The developer cannot reproduce the problem because their environment is not drifted.

This creates a diagnostic trap: "It works when I run it locally" is true. The running service is broken. Both statements are simultaneously accurate.

Compound Effects of Drift

A single stale commit is annoying. Multiple stale repos compounding over time create second-order failures.

Stale code + stale dependencies: The running container has pinned packages that have not been updated. A dependency in the newer code assumes a newer version of a library that is not present in the old image. The error is cryptic — it manifests as a runtime import error or attribute error, not a "you are behind by 5 commits" message.

Stale code + stale configs: A new feature requires an environment variable that was added to .env.example in a recent commit. The running container does not have it. The feature silently falls back to a default, or crashes at the callsite, depending on how the config loading was written.

Stale code + stale schemas: A database migration ran in development. The running service on the production server is querying columns that do not exist in the old schema, or the old service is not writing columns that the new schema requires. Data integrity issues compound over time.

Cross-service drift: Service A depends on an API contract from Service B. Service B's contract changed in a recent PR. Service A's running version expects the old contract. Service B's running version is also stale. When both are finally updated, the behavior is unexpected because neither was updated in order.

Drift Detection: The Deployment-Sync Pattern

The deployment-sync pattern is a scheduled job that compares what is running against what is current.

Step 1 — Collect running SHAs. Each service's health endpoint embeds its git_sha (as described in the previous lesson). The sync script calls each health endpoint and collects the map of service → running_sha.

Step 2 — Collect remote HEAD SHAs. For each repo, run git ls-remote origin HEAD to get the current remote HEAD without requiring a local clone.

Step 3 — Compare and alert. Any service where running_sha != remote_head has drifted. If the gap exceeds a threshold (configurable per service — trading bots get a tighter threshold), fire an alert to the Discord logs channel.

import subprocess
import httpx

SERVICES = {
    "memory-service": {
        "url": "http://localhost:8080/api/health",
        "repo": "your-org/memory-service",
        "max_commits_behind": 2,
    },
    "sports-bot": {
        "url": "http://tesseract:8080/health",
        "repo": "your-org/sports-bot",
        "max_commits_behind": 1,  # trading bot — tighter threshold
    },
}

async def check_drift():
    alerts = []
    for name, config in SERVICES.items():
        health = await get_health(config["url"])
        running_sha = health.get("git_sha", "unknown")
        remote_sha = get_remote_head(config["repo"])
        if running_sha != remote_sha:
            distance = count_commits_between(running_sha, remote_sha, config["repo"])
            if distance > config["max_commits_behind"]:
                alerts.append(f"{name}: {distance} commits behind ({running_sha[:7]}..{remote_sha[:7]})")
    return alerts

This job runs on a cron schedule. The output goes to Discord. When alerts accumulate, they are visible before they become incidents.

The Scheduled Drift Audit

Beyond continuous monitoring, a weekly manual drift audit catches what automation misses. The checklist:

# For each machine in the fleet
ssh prod-host

# For each service running as a Docker container
docker ps --format "{{.Names}}" | while read name; do
    image=$(docker inspect "$name" --format '{{.Image}}')
    created=$(docker inspect "$name" --format '{{.Created}}')
    echo "$name: image=$image created=$created"
done

# Compare against expected versions
# Flag anything older than 7 days for review

For services without Docker (launchd Python services), check the running process's reported version against git log -1 --format="%H" ~/dev/<repo>.

The audit does not need to be automated to be effective. A 15-minute weekly review catches the accumulation before it reaches the critical threshold where closing the gap feels risky.

The One-Way Ratchet

Drift cannot be resolved by ignoring it. It can only be resolved by deploying. The accumulation is a one-way ratchet: gaps can grow but they cannot shrink on their own.

The corrective discipline is: every merged PR ends with a deployment. Not a note. Not a task. A deployment. If the deployment cannot happen immediately because of a dependency or timing concern, the PR description gets a note: "Deploy blocked pending X" and a ticket is opened.

The deployment-is-done rule closes the loop that drift exploits. When merge and deploy are mentally decoupled, drift is possible. When they are treated as a single operation with two steps, drift stops accumulating.

Key Takeaways

Drift is self-reinforcing: small gaps are easy to defer, and deferring creates larger gaps that are even easier to defer.
The March 2026 audit found 11 repos drifted across 2 machines with no alerts, including active trading bots running stale code.
Compound drift creates second-order failures: stale code + stale deps + stale configs + stale schemas interact in ways that are harder to diagnose than any individual gap.
The deployment-sync pattern — scheduled comparison of running SHAs against remote HEAD — provides continuous drift detection without manual effort.
The only resolution to drift is deployment. Noting it, scheduling it, or deferring it does not close the gap.

What's Next

Not all drift is caused by skipped deploys. In the next lesson, we examine permission drift — how AI agent sessions create files with restrictive permissions that silently survive into Docker containers and cause crash loops only when a different user reads them.

Drift Is Exponential