Duplicate Services Equal Financial Risk

The Incident

On March 29, 2026, an audit of the Sports Prediction Agent revealed two live instances running simultaneously on Tesseract.

Both instances were connected to the same Polygon wallet. Both were monitoring the same markets. Either instance could submit a trade. Neither instance knew the other existed.

The financial exposure: the same position could be opened twice, spending double the intended capital. A market that should have been skipped (already traded by Instance 1) could be re-entered by Instance 2. Stop-loss logic in each instance was tracking its own trades, not the combined position. The wallet balance could be depleted faster than any configured limit accounted for.

Why Duplicates Happen

Duplicate service instances are not a random occurrence. They follow a predictable pattern that repeats across teams and projects:

The dev-left-it-running pattern:

Engineer SSHs into the server to debug or test something.
Starts the service manually: python sports-bot/main.py.
Confirms it looks OK, disconnects from SSH.
The terminal session closes. The nohup'd or backgrounded process keeps running.
That evening, launchd's KeepAlive = true fires and starts a second instance.
Both instances run indefinitely until someone explicitly audits process lists.

The restart-without-stop pattern:

CI deploys a new version and runs launchctl kickstart without first running launchctl stop.
The old process is still alive. The new one starts.
For a brief window — or indefinitely if kickstart does not kill the old one — both run.

The failover-without-fencing pattern:

A primary instance crashes, launchd restarts it.
The crash left a lock or queue entry that the new instance picks up.
An external watchdog also detected the crash and spawned its own instance.
Two "recovery" instances are now both processing the same queue.

The PID Lockfile Pattern

The industry-standard solution for preventing duplicate processes is the . The pattern is simple and robust:

At startup:

import os
import sys

LOCKFILE = "/tmp/sports-bot.pid"

def acquire_lock():
    if os.path.exists(LOCKFILE):
        with open(LOCKFILE) as f:
            old_pid = int(f.read().strip())
        # Check if that process is still alive
        try:
            os.kill(old_pid, 0)  # signal 0 = existence check, no actual signal sent
            print(f"[ABORT] Sports Prediction Agent is already running as PID {old_pid}. Exiting.")
            sys.exit(1)
        except ProcessLookupError:
            # PID file is stale — previous process is gone
            print(f"[INFO] Stale lockfile found (PID {old_pid} is dead). Removing.")
            os.remove(LOCKFILE)

    # Write our PID
    with open(LOCKFILE, "w") as f:
        f.write(str(os.getpid()))
    print(f"[INFO] Lock acquired: PID {os.getpid()}")

def release_lock():
    if os.path.exists(LOCKFILE):
        os.remove(LOCKFILE)

At shutdown (via atexit or signal handlers):

import atexit
import signal

atexit.register(release_lock)

def handle_signal(signum, frame):
    release_lock()
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_signal)
signal.signal(signal.SIGINT, handle_signal)

The critical detail is os.kill(pid, 0). This sends signal 0, which does not affect the target process at all — it only checks whether the process exists. If ProcessLookupError is raised, the process is gone, and the lockfile is stale and safe to delete.

Why Trading Services Are the Highest-Risk Case

For general-purpose services, a duplicate instance might cause log noise or slightly elevated CPU. For trading services, the failure modes are financial:

Double-spending: Both instances see the same market opportunity, pass the same signal threshold, and each submits a full-sized position. Capital exposure doubles instantly.

Conflicting stop-losses: Instance 1 opens a position. Instance 2 does not know about it (position is tracked in Instance 1's in-memory state or a shared DB that neither is locking). When price moves adversely, Instance 1's stop-loss fires and closes. Instance 2, seeing a fresh market, opens again. You are now permanently in the position despite a stop.

Budget exhaustion: Each instance tracks its own daily loss limit. Two instances at 50% daily limit each are actually at 100% combined. The budget guardrail is bypassed.

Race conditions on shared USDC.e balance: Polygon transactions are not synchronous. Both instances read the same wallet balance, both calculate they have sufficient margin, both submit. One succeeds; one fails with an on-chain error or, worse, partially fills.

Alternative Singleton Patterns

PID lockfiles work well for Python processes and long-running services. Other patterns are appropriate in different contexts.

Unix Socket Exclusivity

A process that listens on a Unix domain socket achieves natural singleton behavior: the second instance trying to bind to the same socket path fails immediately with Address already in use. No explicit lockfile management needed:

import socket
import os

SOCKET_PATH = "/tmp/sports-bot.sock"

def bind_singleton_socket():
    sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    try:
        sock.bind(SOCKET_PATH)
    except OSError:
        print(f"[ABORT] Socket {SOCKET_PATH} already bound — another instance is running.")
        sys.exit(1)
    return sock

Clean up on exit with os.unlink(SOCKET_PATH).

`flock()` on a File

The fcntl.flock() system call provides an advisory lock that the OS automatically releases when the process dies — no cleanup code needed:

import fcntl

LOCKFILE = open("/tmp/sports-bot.lock", "w")

try:
    fcntl.flock(LOCKFILE, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
    print("[ABORT] Another instance holds the lock.")
    sys.exit(1)

LOCK_NB (non-blocking) makes the lock attempt fail immediately rather than waiting. The lock releases automatically when the file descriptor is closed — even if the process crashes.

systemd / launchd Singleton Enforcement

For services managed by launchd, KeepAlive = true combined with ProcessType = Background gives the OS responsibility for lifecycle management. As long as only one launchd job owns the service, only one instance runs. The danger is exactly the March 29 scenario: a manually started process outside of launchd coexists with the launchd-managed one.

The Audit Command

Catching duplicate processes is straightforward if you look:

# Check for Sports Prediction Agent duplicates
pgrep -fa sports-bot

# Check for any Python service duplicates
ps aux | grep python | grep -v grep

# Check for multiple instances of a named service
ps aux | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

Add this check to your regular deployment runbook and to any monitoring script that runs after deployments.

Key Takeaways

Duplicate stateful service instances are financially dangerous for trading bots — they bypass budget limits, double positions, and create conflicting state in a shared wallet.
The typical cause is a manually started process left running when launchd also manages the same service, not a systemic architecture failure.
PID lockfiles with os.kill(pid, 0) liveness checking are the standard defense; flock() is cleaner because the OS handles cleanup on process death.
Never manually start a launchd-managed service without first stopping the launchd job — confirm with pgrep before proceeding.
Duplicate detection should be part of every deployment runbook and post-deploy health check, not a forensic activity after something goes wrong.

What's Next

The March 29 audit found the duplicates — but only because someone manually SSH'd in and looked. In the next lesson, we tackle the root cause of that reactive posture: what it means to have real version observability, and how to build health endpoints that tell you what is running without requiring SSH.