Blast Radius: Keeping Parallel Agents Safe
Parallel agents amplify speed. They also amplify mistakes. A buggy agent touching a shared repository at scale can corrupt everything. Isolation layers, worktree strategies, rollback protocols, and the 2-failure rule for fleets — keeping blast radius minimal.
The same property that makes parallel agents powerful makes them dangerous: they all act simultaneously.
A single agent that writes to the wrong file produces a single corrupted file. Three parallel agents that all write to the wrong file produce three corrupted files — plus the merge conflict when they try to reconcile. Five parallel agents executing a flawed strategy on a live trading system can place five simultaneous bad positions before any single failure is caught.
Parallel execution amplifies throughput. It also amplifies damage. The answer is not to stop parallelizing. The answer is isolation.
The Parallel Risk Surface
Three categories of blast radius risk in parallel agent systems:
Shared file corruption. Multiple agents writing to the same file simultaneously without locking. Both agents read the current state, modify it independently, and write back — the second write silently overwrites the first. The state is now wrong, with no error raised. This is the most common production failure in multi-agent systems.
Cascading failures. Agent A fails and writes partial output to the state layer. Agent B reads that partial output and produces incorrect results. Agent C reads B's output and propagates the error further. By the time the failure is visible, it has touched three stages of the pipeline and the damage is not localized to any one agent.
Resource conflicts. Two agents touch the same external resource — same git branch, same API endpoint, same database row — simultaneously. The resource's state becomes undefined.
None of these are hypothetical. The agent-fleet work in February 2026 exposed all three in a single week before the isolation layer was properly designed.
Isolation Layer 1: Git Worktrees
For code work, git worktrees are the primary isolation mechanism. A worktree is a separate working directory linked to the same git repository but checking out a different branch. Each agent gets its own worktree, its own branch, and its own filesystem path. An agent editing files in worktrees/agent-a/ has no interaction with worktrees/agent-b/.
# Create isolated worktrees for parallel agents
git worktree add worktrees/agent-a feature/task-a
git worktree add worktrees/agent-b feature/task-b
git worktree add worktrees/agent-c feature/task-c
# Each agent operates in its own worktree
agent_a --workdir worktrees/agent-a
agent_b --workdir worktrees/agent-b
agent_c --workdir worktrees/agent-c
# After review: merge worktrees back in order
cd worktrees/agent-a && git commit -m "..."
gh pr create --head feature/task-a --base main
The isolation guarantee: Agent C failing and corrupting its worktree affects only worktrees/agent-c. Agent A and B continue running. The main branch is untouched until each PR is reviewed and merged.
Blast radius: one feature branch. Not the entire repository.
Isolation Layer 2: Separate Directories
For non-code work — content generation, data processing, image creation — worktrees are overkill. Separate directories with a consistent naming convention achieve the same isolation.
pipeline_run_2026031109/
agents/
researcher/
output/research.json
logs/agent.log
writer/
output/article.md
logs/agent.log
image_gen/
output/thumbnail.png
logs/agent.log
Each agent writes exclusively to its own directory. The orchestrator reads from those directories to assemble the final output. No agent can corrupt another agent's directory because no agent has write access to anything outside its own path.
Isolation Layer 3: Containers
For agents executing code against live systems — running arbitrary scripts, interacting with databases, deploying to staging environments — containers add a full OS-level isolation boundary. A containerized agent cannot touch the host filesystem, cannot access network resources outside its defined network, and cannot affect any other container.
A production trading fleet uses this for strategy validation. When a new signal strategy is being evaluated, the evaluation agent runs in an isolated container with access to historical data but no write path to the live position state. A bug in the evaluation agent — even a destructive one — cannot touch the live system.
Container overhead is real. For most content pipeline agents, it is not worth the operational cost. For any agent that interacts with live financial systems, user data, or production infrastructure, it is not optional.
The 2-Failure Rule for Fleets
The stop-and-replan rule from Lesson 13 applies at the fleet level with one modification: when an agent in a fleet fails twice, halt that specific branch — not the entire fleet.
The protocol:
- Agent fails on first attempt → retry with original spec
- Agent fails on second attempt → halt this branch, log the failure, surface to orchestrator
- Orchestrator halts downstream agents that depend on this branch (but not independent branches)
- Orchestrator replans the failed subtask: new spec, possibly a different approach
- All independent agents continue running
- Re-spawn the failed branch with the corrected spec
This is the surgical version of stop-and-replan. A single branch failure does not collapse the fleet. It only stops the branches that were downstream of the failure.
Rollback Strategies
When isolation is implemented correctly, rollback is simple:
Worktree rollback: Delete the worktree and the branch. The main branch is untouched. Re-cut the branch from main, re-spawn the agent.
git worktree remove worktrees/agent-c --force
git branch -D feature/task-c
git worktree add worktrees/agent-c feature/task-c-v2
# Re-spawn with corrected spec
Directory rollback: Delete the agent's output directory. Re-run the agent with a corrected spec. The run directory for other agents is untouched.
State layer rollback: If the state.json was corrupted (stale lock, partial write), restore from the event log. Replay the events that preceded the corruption. This is why the event log must be append-only — it is your rollback source of truth.
The parallel: knowing when to retreat a single branch rather than fighting through a broken execution path is the discipline that preserves the fleet.
Pre-Spawn Safety Checklist
Before spawning any parallel agents on a production system:
- Each agent has a dedicated worktree or directory
- No agent has write access outside its own path
- Shared state files have the lock protocol implemented
- The event log is initialized for this run
- The 2-failure rule is configured in the orchestrator
- Rollback procedure is documented (delete worktree path + recut command)
- All agents have been tested in isolation before parallel dispatch
Skipping the checklist is how you find out what your blast radius actually is, in production, at the worst possible time.
Lesson 69 Drill
For a parallel agent system you are building or planning:
- Identify every shared resource your agents can touch (files, git branches, external APIs, databases)
- For each shared resource, define the isolation boundary: worktree, directory, container
- Write the rollback procedure for each agent: "if this agent fails on attempt 2, I will..."
- Implement the 2-failure rule in your orchestrator logic before first deploy
The isolation layer takes 2 hours to design and implement correctly. It saves you from an incident that takes 6 hours to diagnose and fix.
Bottom Line
Parallelism amplifies everything — speed and damage equally.
Isolation is the control mechanism. Worktrees for code. Directories for files. Containers for high-risk execution. The 2-failure rule for fleet-level stop-and-replan.
Design the blast radius before you spawn the first agent. Once an incident happens in production, you will design it anyway — but under pressure, with actual damage to clean up.
Design it first.