ASK KNOX
beta
LESSON 236

Ghost Processes Accumulate

Zombie processes, orphaned test runners, and duplicate instances pile up without active cleanup. They are invisible until they aren't.

9 min read

The process table on Tesseract, March 29, 2026:

  • PID 70240: python foresight/main.py — started 7 days ago. Stale code. The launchd instance was also running.
  • Two Shiva instances: one from launchd, one from a TTY session someone had left open. Both trading on the same Polygon wallet.
  • Nineteen pytest processes from agent coding sessions — none had cleaned up after themselves.

None of these had triggered any alert. All were consuming resources. Two of them were placing trades.

The Ghost Process Taxonomy

are processes that should not be running — either because they have finished their purpose, because they are duplicates of a managed service, or because they represent abandoned work that was never cleaned up.

Zombie (defunct): A process that has completed execution but whose exit status has never been collected. The process occupies a slot in the process table but consumes no CPU. Zombies accumulate when a parent process spawns children and fails to call wait() after they exit. Visible as Z in ps status. Individually harmless; in bulk they indicate a parent with broken lifecycle management.

Orphan: A running process whose parent has exited. The orphan is adopted by PID 1 (init/launchd). Unlike zombies, orphans are still executing and consuming resources. Common in agent sessions: an agent spawns a subprocess, the agent session ends, the subprocess continues running indefinitely.

Duplicate: Two or more instances of the same service running simultaneously. Occurs when a service is started manually (TTY session, script, test run) while a managed instance (launchd, Docker) is already running. For stateless services, duplicates cause resource waste. For stateful services — especially trading bots — duplicates cause correctness failures.

Stale: A long-running process that is still alive but running outdated code. The service was started from a git checkout that has since been superseded. No crash, no error — just wrong behavior that diverges from the current codebase. The Foresight PID 70240 was this type.

The March 2026 Ghost Census

Foresight PID 70240

Seven days prior, someone had started a Foresight instance manually — likely for a debugging session or a one-off strategy test. The session ended. The process did not. Launchd was also running a Foresight instance from the managed plist. Two Foresight processes ran in parallel for seven days.

Foresight is read-heavy — it monitors Polymarket markets and places bets based on probability signals. Two instances running in parallel means two sets of polling calls, potentially two sets of position evaluations, and depending on the state management, possible double-execution of trade signals.

The stale instance was running a checkout that was 7 days and approximately 12 commits behind the launchd instance. Both were analyzing the same markets with different logic.

Duplicate Shiva

Shiva is a sports Polymarket bot trading on Polygon using a real wallet (0x39A1525f1CeD5d4Ee334d6357758a0daf0F4e75b). The launchd instance (com.knox.shiva) was the managed service. A second Shiva instance was running in a TTY session — started during a debugging session, never killed.

Both instances shared the same wallet. Both were executing the same trading logic. Both were reading the same position state file. Order submissions from one instance were not visible to the other's in-memory state until the next state file write. This created a window where both instances could attempt to enter the same position independently.

19 Orphaned pytest Processes

Agent coding sessions spawn pytest to run tests. When the agent session ends — whether by timeout, completion, or interruption — the pytest process continues running if it was not explicitly terminated. Over multiple sessions, these accumulate.

19 pytest processes were running simultaneously on Tesseract. They were collectively consuming approximately 8GB of memory and elevating CPU utilization enough to affect the trading bots' latency. None were doing useful work — they were zombie test runs from sessions that had long since ended.

Why Ghost Processes Accumulate

The accumulation is structural. Three failure modes:

No cleanup on agent exit. AI coding agents spawn subprocesses (test runners, linters, build tools) and do not register cleanup handlers. When the agent session ends, those subprocesses continue. The agent spawning process exits; the children are adopted by PID 1.

Manual testing left open. A developer connects via SSH, starts a service to test something, finishes, disconnects. The service keeps running. Terminal multiplexers (tmux, screen) make this worse — sessions persist across disconnects.

Launchd and manual coexist. A service has a launchd plist managing it. The developer also starts it manually from a script or command line. Both run. The developer thinks they are testing the manual instance; launchd is also running and may be doing conflicting work.

Detection

Ghost processes do not announce themselves. Detection requires active inspection.

Find long-running processes:

# Sort by elapsed time, show processes running > 24 hours
ps aux --sort=etime | awk 'NR==1 || $10 ~ /[0-9]+-/'

# Find processes by name, show elapsed time
ps -eo pid,etime,comm,args | grep -E "foresight|shiva|pytest|python" | sort -k2 -r

Find duplicate service instances:

# Count instances of each service
pgrep -a -f "foresight/main.py" | wc -l   # should be 1
pgrep -a -f "shiva/main.py" | wc -l       # should be 1

# If count > 1, identify which is managed (check launchd)
launchctl list | grep com.knox

Find zombie processes:

ps aux | awk '$8 == "Z"'

Find orphaned pytest:

# pytest processes not owned by a terminal
ps aux | grep pytest | grep -v grep | awk '{print $1, $2, $6}'

Prevention: PID Lockfiles

A prevents duplicate instances at the application level. The pattern:

import os
import sys
import atexit

LOCKFILE = "/tmp/shiva.pid"

def acquire_lock():
    if os.path.exists(LOCKFILE):
        with open(LOCKFILE) as f:
            existing_pid = int(f.read().strip())
        try:
            os.kill(existing_pid, 0)  # check if process exists
            print(f"Shiva already running (PID {existing_pid}). Exiting.")
            sys.exit(1)
        except ProcessLookupError:
            # Stale lockfile — previous instance did not clean up
            os.remove(LOCKFILE)

    with open(LOCKFILE, "w") as f:
        f.write(str(os.getpid()))

    atexit.register(lambda: os.path.exists(LOCKFILE) and os.remove(LOCKFILE))

if __name__ == "__main__":
    acquire_lock()
    main()

With this pattern, a second manual start attempt exits immediately with a clear message. The trading bot cannot run as a duplicate regardless of how it is launched.

Prevention: trap EXIT in Shell Scripts

For scripts that spawn background processes:

#!/bin/bash
# test-runner.sh

# Register cleanup for all child processes
cleanup() {
    echo "Cleaning up..."
    kill -- -$$ 2>/dev/null  # kill entire process group
    wait
}
trap cleanup EXIT INT TERM

# Run tests — any background jobs are killed when script exits
pytest tests/ &
TEST_PID=$!
wait $TEST_PID

The trap cleanup EXIT ensures that when the script exits — for any reason, including the parent agent session ending — all child processes are killed. The process group kill (kill -- -$$) catches grandchildren as well.

Prevention: Process Group Cleanup in Agent Sessions

For agent-spawned processes, the agent should register a cleanup handler that kills its process group:

# At the start of any agent session that will spawn subprocesses
export AGENT_SESSION_PID=$$

# At the end of the session (or in a post-command hook)
kill -- -$AGENT_SESSION_PID 2>/dev/null

Claude Code's hook system can run a cleanup command on session exit, ensuring that any pytest or tool processes spawned during the session are terminated when the session ends.

The Weekly Process Audit

Beyond automated prevention, a weekly process audit catches accumulation before it affects performance or correctness:

#!/bin/bash
# scripts/process-audit.sh

echo "=== Long-Running Processes (>24h) ==="
ps aux --sort=etime | awk 'NR==1 || $10 ~ /[0-9]+-/' | head -20

echo ""
echo "=== Duplicate Service Instances ==="
for service in foresight shiva leverage hermes apollo; do
    count=$(pgrep -c -f "$service" 2>/dev/null || echo 0)
    if [ "$count" -gt 1 ]; then
        echo "WARNING: $service has $count instances"
        pgrep -a -f "$service"
    fi
done

echo ""
echo "=== Zombie Processes ==="
ps aux | awk '$8 == "Z"' | head -10

echo ""
echo "=== Orphaned pytest (no terminal) ==="
ps aux | grep "[p]ytest" | awk '{print $1, $2, $11}'

Run this script weekly. Pipe output to the Discord logs channel. Accumulation is visible before it becomes an incident.

Key Takeaways

  • Ghost processes accumulate in four types: zombie (finished, uncollected), orphan (running without supervision), duplicate (multiple instances of the same service), and stale (running outdated code).
  • The March 2026 audit found a duplicate Shiva trading bot (two instances against the same wallet), a 7-day-old stale Foresight instance, and 19 orphaned pytest processes collectively consuming 8GB of memory.
  • PID lockfiles prevent duplicate service instances at the application level — mandatory for any service that manages money or shared state.
  • trap cleanup EXIT in shell scripts and agent session cleanup hooks prevent orphan accumulation from test runners and tooling.
  • A weekly process audit script, piped to a monitoring channel, surfaces accumulation before it becomes a resource or correctness incident.

What's Next

This completes the Infrastructure Drift track. You now have the full picture: code gaps, lying health checks, exponential drift, permission bombs, and ghost processes. The next track applies these lessons to building a deployment verification system that catches all five failure modes automatically before they reach production.