ASK KNOX
beta
LESSON 240

Version Observability

If you cannot tell what version is running without SSH and git log, you have no observability. Embed version info everywhere.

8 min read

The Incident

March 29, 2026. The Akashic Records service was suspected to be running stale code. How long had it been stale? Which version was actually deployed? Was the Docker image on Mac Mini the same as what had last been built?

The answer to all of these questions required:

  1. SSH into the Mac Mini
  2. docker exec -it akashic /bin/bash
  3. Navigate to the source directory
  4. git log --oneline -5 to check commit history
  5. Compare against the remote main branch manually

The health endpoint existed. It returned 200 OK. It was completely useless for the purpose it needed to serve.


The Observability Gap in Homelab Infrastructure

Production systems at scale embed version metadata everywhere — log aggregators display build tags, APM tools track deployment markers, dashboards show git SHAs next to error rates. Homelab and small-team infrastructure almost never does this, because it feels like over-engineering until the first time you cannot answer "what is running?"

The question "what version is deployed?" comes up in exactly the moments you can least afford to investigate manually:

  • Something is broken and you need to know if the fix was deployed
  • A new feature is not behaving as expected and you need to confirm the new code is in the container
  • You are on a different machine (MacBook traveling, not at the Mac Mini) and SSH would require VPN or Tailscale
  • An incident is in progress and every minute of investigation is a minute of potential financial or data loss

What Good Version Observability Looks Like

A well-instrumented service answers the following questions via its health endpoint in under 100ms:

  1. What commit is this? — git SHA at build time
  2. When was this built? — ISO 8601 build timestamp
  3. How long has it been running? — uptime in seconds (detects silent restarts)
  4. Where is it running? — hostname (detects which machine you are talking to)
  5. What process is serving this? — PID (detects duplicate instances, see Lesson 239)
  6. What does this service know about itself? — any domain-specific health metrics

Here is what a complete health endpoint looks like for a Python FastAPI service:

import os
import socket
import time
from datetime import datetime, timezone

START_TIME = time.time()

@app.get("/health")
def health():
    return {
        "status": "ok",
        "version": os.getenv("GIT_SHA", "unknown"),
        "build_time": os.getenv("BUILD_TIME", "unknown"),
        "uptime_seconds": round(time.time() - START_TIME),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
        "service": "akashic-records",
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }

Sample response:

{
  "status": "ok",
  "version": "a3f8c21",
  "build_time": "2026-03-28T14:22:11Z",
  "uptime_seconds": 86421,
  "hostname": "knox-mac-mini",
  "pid": 54771,
  "service": "akashic-records",
  "timestamp": "2026-03-29T09:15:44Z"
}

From this single response you can immediately determine: the commit a3f8c21 was deployed, it has been running for about 24 hours, it is on knox-mac-mini, and it is process 54771. If your expected commit is b9d1e44 and this returns a3f8c21, you know the deployment did not land.


Baking the Git SHA Into Docker Images

The git SHA must be injected at build time, not at runtime. If it is read from a .git directory at startup, it will fail inside containers (no .git directory) and will reflect the host's git state, not the built code.

The correct pattern uses Docker build arguments:

# In your Dockerfile
ARG GIT_SHA=unknown
ARG BUILD_TIME=unknown
ENV GIT_SHA=${GIT_SHA}
ENV BUILD_TIME=${BUILD_TIME}

Pass the values at build time:

docker build \
  --build-arg GIT_SHA=$(git rev-parse --short HEAD) \
  --build-arg BUILD_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  -t akashic:latest .

Or in docker-compose.yml:

services:
  akashic:
    build:
      context: .
      args:
        GIT_SHA: "${GIT_SHA:-unknown}"
        BUILD_TIME: "${BUILD_TIME:-unknown}"

Then in your CI script or Makefile:

export GIT_SHA=$(git rev-parse --short HEAD)
export BUILD_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
docker compose build akashic

The SHA is now frozen inside the image layer. Even if you rebuild with the same source code, a new SHA bakes in. Stale deployments become immediately detectable.


Propagating Version to Other Surfaces

The health endpoint is the primary surface, but version information should appear in at least two more places.

Startup Log Lines

Log the version at service startup. This creates a permanent record in your log files and makes it trivially easy to grep for when a service last restarted:

import logging

logger = logging.getLogger(__name__)

def main():
    logger.info(
        "Starting akashic-records version=%s build_time=%s pid=%d",
        os.getenv("GIT_SHA", "unknown"),
        os.getenv("BUILD_TIME", "unknown"),
        os.getpid(),
    )

Log output:

2026-03-29 09:15:44 INFO Starting akashic-records version=a3f8c21 build_time=2026-03-28T14:22:11Z pid=54771

HTTP Response Headers

For services with external API consumers, add the version as a response header. This lets any HTTP client — including curl — see the version without hitting a dedicated endpoint:

from fastapi import Request, Response

@app.middleware("http")
async def add_version_header(request: Request, call_next):
    response = await call_next(request)
    response.headers["X-Service-Version"] = os.getenv("GIT_SHA", "unknown")
    return response

Check it from the command line:

curl -I http://100.91.193.23:8002/discover
# X-Service-Version: a3f8c21

Cross-Machine Version Comparison

Once every service exposes its git SHA, you can automate drift detection across machines. A simple script queries every service's health endpoint and compares the running SHA against the latest commit on main:

#!/bin/bash
# check-versions.sh

SERVICES=(
  "akashic|http://100.91.193.23:8002/health"
  "shiva|http://192.168.1.150:8003/health"
)

EXPECTED_SHA=$(git -C ~/Documents/Dev/akashic-records rev-parse --short HEAD)

for entry in "${SERVICES[@]}"; do
  name="${entry%%|*}"
  url="${entry##*|}"
  running_sha=$(curl -sf "$url" | python3 -c "import sys,json; print(json.load(sys.stdin).get('version','error'))")
  if [ "$running_sha" = "$EXPECTED_SHA" ]; then
    echo "OK    $name: $running_sha"
  else
    echo "DRIFT $name: running=$running_sha expected=$EXPECTED_SHA"
  fi
done

Run this after every deployment. Wire it into a cron job for continuous monitoring. Alert when drift is detected.


The "Phone Home" Pattern

For homelab services that are not always actively monitored, a lightweight "phone home" pattern augments health endpoints with push-based version reporting.

On startup, the service sends a single notification to a central channel (Discord, OpenClaw event bus, etc.) with its version info:

import httpx
import asyncio

async def phone_home():
    payload = {
        "text": f"akashic-records started — version={os.getenv('GIT_SHA', 'unknown')} "
                f"host={socket.gethostname()} pid={os.getpid()}"
    }
    async with httpx.AsyncClient() as client:
        await client.post(DISCORD_WEBHOOK_URL, json={"content": payload["text"]})

asyncio.run(phone_home())

When something goes wrong, your Discord history shows exactly when each service last started and which version it was. This is passive observability that requires no active polling.


Key Takeaways

  • A health endpoint returning a hardcoded version string is not observability — it is theater. The only reliable version signal is a git SHA baked in at Docker build time via build arguments.
  • Version information should appear in at least three surfaces: the /health endpoint, startup log lines, and HTTP response headers (X-Service-Version).
  • Cross-machine drift detection becomes automatable once every service exposes its running SHA — a comparison script turns a 4-minute SSH investigation into a sub-second check.
  • Uptime in health responses reveals silent restarts that would otherwise go unnoticed; a service that "restarted overnight" without a logged reason is a security and reliability signal.
  • The "phone home" pattern provides free deployment history in your notification channel, requiring no active monitoring infrastructure.

What's Next

You now have the individual components: health checks, version baking, PID lockfiles, Docker cache discipline, and interface-aware testing. In Lesson 241, we close the track by assembling these pieces into a complete drift detection system — automated, multi-machine, and capable of catching the March 29 class of incidents before they affect running services.