The Zombie Tax: OOM Prevention and Process Hygiene for Persistent AI Platforms

The terminal said it was using 3.15 GB of RAM. A single Chrome process had been running inline in the session for one hour and forty-nine minutes. The OS killed it. Then it killed everything else.

That was incident two of three in a two-hour firefight that started with a watchdog daemon leaking 204 orphaned Node.js processes — each one accumulating CPU time, each one invisible until ps aux returned a count so high it looked like a bug in the command.

This lesson is the post-mortem. Three distinct failure modes, three fixes, one underlying pattern: processes that escape their intended lifetime drain resources until the system collapses.

The Three Failure Modes

Failure Mode 1: The Full Test Suite Hook

The first OOM wasn't dramatic. It was incremental. Every time a file was saved in a project with 1,970 tests, a PostToolUse hook fired and ran the full test suite in the background.

On a small project, this is fine. On a project with 1,970 pytest tests and coverage collection enabled, this is a 200–400 MB allocation per save event. In a long coding session with dozens of saves, the memory never fully returns to the OS before the next run starts. The heap grows. Eventually the OOM killer arrives.

The fix is scope discipline. You don't need the full suite to know if your change broke something. You need the related test file.

# Find the test file related to the file being edited
find_related_test() {
  local stem
  stem=$(basename "$1" | sed 's/\.[^.]*$//')

  # Python: test_<stem>.py
  find . -name "test_${stem}.py" -not -path "*/node_modules/*" | head -1

  # TypeScript: <stem>.test.ts
  find . -name "${stem}.test.ts" -not -path "*/node_modules/*" | head -1
}

If no related test file exists, skip. Don't fall back to the full suite. The full suite runs in CI. The hook runs on save.

Additional guards that matter:

# Skip if RAM is low — the last thing a pressure situation needs is more test processes
FREE_GB=$(vm_stat | awk '/Pages free/{f=$3} /Pages inactive/{i=$3} END{print int((f+i)*4096/1073741824)}')
if [ "$FREE_GB" -lt 4 ]; then exit 0; fi

# Always suppress coverage — it doubles memory and you don't need it on save
pytest path/to/test_file.py --no-cov -p no:cacheprovider

Failure Mode 2: The Chrome Process That Outlived Its Session

The second OOM was visible from the Activity Monitor: Terminal was using 3.15 GB. A Chrome instance launched by a browser-automation tool had been running for one hour and forty-nine minutes inside an interactive session. The session ended. The Chrome process did not.

This failure mode is common in AI platforms that use browser automation. The tool spawns Chrome, does its work, and exits normally. But Chrome — especially launched with --remote-debugging-port and a persistent profile directory — is engineered to stay resident. It doesn't exit when the controlling process does.

In a platform where sessions can run for hours and multiple sessions may run concurrently, every one of those Chrome instances is a background RAM drain waiting to become a foreground crisis.

The fix is structural: Chrome must not outlive a session. Kill it on stop.

# In session-retro.sh (Stop hook) — always runs on session end
pkill -9 -f "chrome-cdp-profile" 2>/dev/null || true
pkill -9 -f "remote-debugging-port=9242" 2>/dev/null || true

And as a RAM pressure guard, kill it before it triggers the OOM killer:

# In context-window-monitor.sh (PostToolUse hook) — runs on every tool call
FREE_PAGES=$(vm_stat 2>/dev/null | awk '/Pages free/{gsub(/\./,"",$3);print $3}')
INACTIVE_PAGES=$(vm_stat 2>/dev/null | awk '/Pages inactive/{gsub(/\./,"",$3);print $3}')
FREE_GB=$(( (FREE_PAGES + INACTIVE_PAGES) * 4096 / 1024 / 1024 / 1024 ))
if [ "$FREE_GB" -lt 2 ]; then
  pkill -9 -f "chrome-cdp-profile" 2>/dev/null || true
fi

Failure Mode 3: The Zombie Tax

This was the one that required a ps aux | wc -l to find. The watchdog daemon (Horus) calls notify_openclaw() to fire a system event when a service is unhealthy. The notification call looked like this:

def notify_openclaw(message: str) -> None:
    try:
        subprocess.run(
            ["openclaw", "system", "event", "--text", message, "--mode", "now"],
            timeout=10, capture_output=True
        )
    except Exception:
        pass

This is a pattern you will find in virtually every codebase that calls external CLI tools. It looks correct. The timeout=10 parameter appears to handle the case where the tool hangs. The except Exception: pass makes it non-fatal.

It is not correct.

Python's subprocess.run() raises TimeoutExpired when the timeout expires, but it does not kill the child process. The child continues running. The except Exception: pass swallows the exception. The parent moves on. The child is now an orphan with PPID=1, accumulating CPU time, burning memory, and contributing nothing.

The watchdog was checking 15 monitors on a 30-second interval. Several were failing. The openclaw-gateway service it notified through was itself unhealthy — every notification call was hanging until timeout. Over two hours, 204 orphaned Node.js processes accumulated. Each one was consuming 0.1–0.2% CPU. Collectively they were consuming ~200% CPU continuously.

Diagnosing this required one command:

ps -eo pid,ppid,etime,command | grep openclaw | sort -k3

Every process showed PPID=1 — orphaned. Runtimes spanning from four minutes to two hours and thirty-eight minutes. The signature of accumulated zombies is unmistakable once you know what to look for: dozens of identical processes, all reparented to init, all with growing elapsed times.

The fix is two lines:

def notify_openclaw(message: str) -> None:
    try:
        proc = subprocess.Popen(
            ["openclaw", "system", "event", "--text", message, "--mode", "now"],
            stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
        )
        try:
            proc.wait(timeout=10)
        except subprocess.TimeoutExpired:
            proc.kill()   # ← the line subprocess.run() was missing
            proc.wait()   # ← reap the zombie
    except Exception:
        pass

proc.kill() sends SIGKILL. proc.wait() reaps the process table entry. The child is guaranteed to exit within 10 seconds regardless of what the downstream service is doing.

The Diagnostic Toolkit

When CPU is elevated and you can't find the cause in the usual places, this is the sequence:

# 1. Count processes by name — is anything abnormally numerous?
ps aux | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

# 2. Find orphaned processes (PPID=1 that shouldn't be system daemons)
ps -eo pid,ppid,etime,command | awk '$2 == 1' | grep -v "launchd\|kernel\|/sbin\|/usr"

# 3. Sum CPU by process name
ps aux | grep "process-name" | awk '{sum += $3} END {print "Total CPU%:", sum}'

# 4. Check RAM pressure breakdown
vm_stat | grep -E "Pages free|Pages inactive|Pages active|Pages wired"

The runtimes in column 3 of ps -eo tell the story. A legitimate daemon has a runtime matching system uptime. An orphan spawned two hours ago by a cron job that finished in 30 seconds has a two-hour runtime. That's the fingerprint.

Rate-Limiting Notifications in Watchdog Daemons

The zombie storm was made possible by a second issue: the watchdog was firing notifications on every heal attempt with no rate limit applied to that path. With 15 monitors and a 30-second check interval, a failing service cluster could generate dozens of notification calls per minute.

Every notification call that hangs becomes a zombie. Rate-limiting the notification path is not just a courtesy to your notification channel — it is a memory safety mechanism.

The pattern:

# Only alert if enough time has passed since the last alert
if time.time() - state.last_alert_ts > 300:  # 5-minute cooldown
    _notify(config, name, message)
    state.last_alert_ts = time.time()

Apply this to every notification path: initial failure, heal attempt, heal failure, cooldown period, circuit open. Without it, a single failing service with a fast check interval can spawn hundreds of notification processes before the circuit breaker engages.

The Circuit Breaker Pattern

After three consecutive failed heal attempts, a well-designed watchdog should open its circuit breaker: stop attempting to heal, escalate to the on-call channel, and wait for manual intervention. This is not pessimism — it is the recognition that if three automated heal attempts failed, the fix requires human judgment.

if state.heal_failures >= 3:
    state.heal_circuit_open = True
    _notify(config, name, "CIRCUIT OPEN — manual intervention required")
    # From here: alert every 5 minutes, do not attempt further heals

The circuit breaker serves double duty. It prevents the heal loop from compounding the problem (a restart loop is often worse than staying down), and it caps the notification rate automatically once engaged.

The `INPUT=$(cat)` Anti-Pattern

Unrelated to process zombies but part of the same class of memory debt: shell hooks that load their entire stdin into a variable.

# This loads the full JSON payload into memory for the hook's lifetime
INPUT=$(cat)
SESSION_ID=$(echo "$INPUT" | jq -r '.session_id')

In a PostToolUse hook that fires on every tool call, this variable is allocated and released thousands of times per session. The payload can be hundreds of kilobytes. The fix is to stream directly through jq:

# Read once, parse once, no variable holding the full payload
SESSION_ID=$(jq -r '.session_id // ""' 2>/dev/null)

This is a small optimization per call, but at thousands of calls per session across concurrent agents, it compounds.

Lesson 222 Drill

Audit the subprocess calls in your platform:

Find every place you call subprocess.run() or subprocess.call() with a timeout parameter.
For each one, ask: if this call hangs and the timeout fires, is the child killed?
If the answer is no, convert it to the Popen + proc.kill() pattern.

Then run:

ps -eo pid,ppid,etime,command | awk '$2 == 1' | grep -v "launchd\|kernel\|/sbin\|/usr" | sort -k3 -r | head -20

If anything shows up with a long elapsed time that shouldn't be there, you have a zombie you didn't know about.

Bottom Line

Processes that escape their intended lifetime are not a performance issue. They are a correctness issue. A test hook that runs the full suite is wrong. A Chrome process that outlives its session is wrong. A subprocess that hangs past its timeout without being killed is wrong.

The zombie tax compounds silently. It doesn't declare itself as an OOM event until the bill is due. By then, you have 204 processes, a terminal using 3.15 GB, and an OOM dialog interrupting a live session.

The rules:

Scope test hooks to the related file. Never fall back to the full suite.
Every browser process needs a structural kill guarantee, not best-effort cleanup.
subprocess.run(timeout=N) does not kill the child. Use Popen.
Rate-limit every notification path. The rate limit is memory safety, not just politeness.
When you see elevated CPU and can't explain it, run ps -eo pid,ppid,etime,command and look for PPID=1 orphans with long runtimes.

The platform runs 24/7. It has to. That means it has to manage its own resource lifecycle with the same discipline it applies to its outputs.