ASK KNOX
beta
LESSON 233

Health Checks Lie

A container can report 'healthy' while missing critical API endpoints. Liveness is not correctness.

8 min read·Agent Harness Engineering

The Docker Compose health check ran every 30 seconds. It hit /api/health. The response was {"status": "ok"}. Docker marked the container . The dashboard showed green.

Meanwhile, /api/query/discover — the endpoint powering mind_search — returned HTTP 404.

For eight days, both of these things were simultaneously true.

The Health Check Pyramid

Health checks exist on a spectrum. Most systems implement only the bottom layer and treat it as sufficient. It is not.

Layer 5 — Data integrity:     query returns expected schema
Layer 4 — Smoke query:        critical path executes end-to-end
Layer 3 — Endpoint contract:  all expected API routes exist
Layer 2 — Port responsive:    HTTP listener is accepting connections
Layer 1 — Process alive:      PID exists, not zombie

The standard Docker HEALTHCHECK implements Layer 1 or Layer 2. Most /api/health endpoints implement Layer 2. The layers above that — contract, smoke, integrity — require deliberate engineering.

The Akashic incident exposed a Layer 2 health check protecting a Layer 3+ failure. The process was alive. The port was responsive. But the API contract — the set of endpoints that callers depended on — was incomplete because the container was running stale code.

The Three Health Check Types

Liveness

Question answered: Is the process still running?

Failure mode: Crash loops, zombie states, OOM kills.

Docker implementation:

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:8002/api/health || exit 1

What it misses: Everything above Layer 2. A process can respond to /api/health while being functionally broken in every meaningful way.

Readiness

Question answered: Is the process ready to serve traffic?

Failure mode: Process is alive but still initializing — database connections not established, models not loaded, cache not warmed.

Kubernetes implementation: A separate readiness probe that fails until startup is complete. Docker Compose does not have a native readiness concept — you approximate it with depends_on: condition: service_healthy.

What it misses: API contract correctness. Readiness answers "can I accept a request?" not "will the request succeed?"

Correctness

Question answered: Is the API contract fulfilled?

Failure mode: Stale code with missing endpoints, breaking changes deployed without callers knowing, data schema drift.

Implementation: A health endpoint that enumerates registered routes, embeds build metadata, and optionally executes a low-cost smoke query.

@app.get("/api/health")
async def health():
    registered_routes = [route.path for route in app.routes]
    return {
        "status": "ok",
        "version": "2.1.0",
        "git_sha": os.environ.get("GIT_SHA", "unknown"),
        "built_at": os.environ.get("BUILD_TIMESTAMP", "unknown"),
        "endpoints": registered_routes,
    }

What it catches: The Akashic incident. A caller checking this response would have seen /api/query/discover absent from the endpoints list and known the container was stale.

The Akashic Failure in Detail

The health check configuration in the Akashic Compose file was minimal:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8002/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

This is a standard, reasonable health check for confirming a container has started successfully. It is completely inadequate for detecting a stale image.

The /api/health endpoint in the old image:

@app.get("/api/health")
async def health():
    return {"status": "ok"}

Two words. Always returns 200. Contains no information about what version of the code is running, what endpoints are available, or whether any feature is functional.

The new /api/query/discover endpoint was added in PR #19. It existed in the merged code on GitHub. It did not exist in the running container. The health check had no mechanism to detect this.

Building a Contract-Aware Health Endpoint

A contract-aware health endpoint embeds enough information for an external observer to verify that the correct version of the code is running with the expected API surface.

Minimum viable contract health endpoint:

import os
from fastapi import FastAPI

app = FastAPI()

# Injected at build time via Docker ARG/ENV
GIT_SHA = os.environ.get("GIT_SHA", "unknown")
API_VERSION = os.environ.get("API_VERSION", "unknown")
BUILD_AT = os.environ.get("BUILD_TIMESTAMP", "unknown")

@app.get("/api/health")
async def health():
    routes = [
        route.path
        for route in app.routes
        if hasattr(route, "path")
    ]
    return {
        "status": "ok",
        "version": API_VERSION,
        "git_sha": GIT_SHA,
        "built_at": BUILD_AT,
        "endpoints": sorted(routes),
    }

Dockerfile wiring:

ARG GIT_SHA=unknown
ARG BUILD_TIMESTAMP=unknown
ARG API_VERSION=unknown

ENV GIT_SHA=$GIT_SHA
ENV BUILD_TIMESTAMP=$BUILD_TIMESTAMP
ENV API_VERSION=$API_VERSION

Build command:

docker compose build \
  --build-arg GIT_SHA=$(git rev-parse --short HEAD) \
  --build-arg BUILD_TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --build-arg API_VERSION=2.1.0 \
  akashic

After deployment, a single curl confirms the correct code is running:

curl -s http://localhost:8002/api/health | jq '{sha: .git_sha, endpoints: .endpoints}'

If /api/query/discover is absent from the endpoints list, the container is stale.

Smoke Query Layer

Beyond contract checks, a smoke query layer exercises the actual critical path. This goes beyond confirming an endpoint exists — it confirms the endpoint works.

@app.get("/api/health/deep")
async def deep_health():
    """
    Executes a minimal smoke query to verify the full stack is functional.
    Only use this endpoint for deployment verification, not routine liveness.
    """
    try:
        result = await run_smoke_query("__health_check__")
        return {
            "status": "ok",
            "smoke_query": "passed",
            "latency_ms": result.latency_ms,
        }
    except Exception as e:
        return JSONResponse(
            status_code=503,
            content={"status": "degraded", "smoke_query": "failed", "error": str(e)},
        )

The /api/health/deep endpoint is not for Docker's routine health check — the 30-second poll would add unnecessary load. It is for deployment verification: run it once after every rebuild to confirm the full stack is functional before marking the deployment complete.

Monitoring Health Over Time

A health check that runs at deploy time is better than nothing. A health check that runs continuously surfaces regressions before they become incidents.

The Horus watchdog on the Mac Mini monitors HTTP endpoints on a schedule. Extending Horus to monitor /api/health and alert on unexpected changes to git_sha or missing expected endpoints closes the detection window from "8 days when someone notices" to "minutes when Horus fires."

# Horus monitor config
{
    "name": "akashic-contract",
    "url": "http://localhost:8002/api/health",
    "interval_seconds": 300,
    "checks": [
        {"field": "status", "expected": "ok"},
        {"field": "endpoints", "contains": "/api/query/discover"},
        {"field": "endpoints", "contains": "/api/query/search"},
    ],
    "alert_channel": "discord_logs"
}

Key Takeaways

  • Docker's Up (healthy) status means the process is alive and responding to the health check endpoint. It does not mean the API contract is fulfilled.
  • The Akashic container reported healthy for 8 days while the /api/query/discover endpoint was missing — because the health check only tested process liveness.
  • A contract-aware health endpoint embeds git_sha, api_version, build_timestamp, and the list of registered routes. This turns a 30-second curl into a complete deployment verification.
  • Smoke query endpoints (/api/health/deep) go further — they exercise the actual critical path and confirm the stack is functional end-to-end.
  • Horus-style continuous monitoring with contract checks closes the detection window from days to minutes.

What's Next

The Akashic gap was one of eleven repos that drifted simultaneously. In Lesson 234, we examine why drift does not stay linear — one skipped deploy creates conditions that make the next deploy even easier to skip, and the compounding effect of drift across a fleet.