ASK KNOX
beta
LESSON 277

When Agents Doom-Loop

The session retro pattern: three doom-loops on signal_engine.py in one session. Each time the agent produced a 'fix' that broke a different thing. How to detect doom-loops early, intervene, and recover without burning another two hours.

7 min read

Three times in one session, an agent got stuck on signal_engine.py. Each time, the pattern was the same: the agent would produce a fix, CI would fail, the agent would fix the new failure, CI would fail on something else, and the loop would continue until intervention.

This is the doom-loop. It is not a model problem. It is a spec problem, a data problem, or an environmental problem that the agent cannot see — and the agent is locally optimizing against symptoms that keep shifting because the underlying cause is never addressed.

The Signature

A doom-loop has a few telltale signs:

  • Each fix produces a different error. If the errors were converging toward a clean build, that would be progress. When they keep shifting to new categories, the agent is chasing symptoms.
  • The diffs grow larger over time. The agent accumulates changes to work around the new errors, producing a bloated diff that increasingly diverges from the original intent.
  • The agent starts making assumptions out loud. "I think this might be because..." is a flag that the agent is guessing. Guesses compound.
  • Build time increases. More files touched, more rebuild time, slower feedback.

Any one of these on its own is okay. Two together is a caution. Three together is a stop signal.

The Two-Attempt Rule

The rule exists because humans (and agents) have a bias to keep pushing. "Just one more try" is the instinct. The instinct is wrong. If two attempts failed, the approach has a problem — the spec is ambiguous, the codebase has an undocumented assumption, the environment has drifted, or the task is genuinely harder than it appeared. None of those get better by repeating the attempt.

Inline Diagram — Doom-Loop Anatomy

DOOM-LOOP PATTERN — STOP AT ATTEMPT 2STOP + REPLANre-examine assumptions2 failures → STOP

The Intervention Playbook

When a doom-loop is detected, follow these steps in order:

  1. Halt the agent. No more instructions until the situation is understood.
  2. Pull fresh data. What does git status show? What is the current state of the file the agent is editing? What environment variables are set? The agent may have been operating on stale assumptions about the code or the environment.
  3. Re-read the spec. Is the spec specific enough? Does it match the current codebase (line numbers may have shifted)? Does it make assumptions about behavior that the code does not exhibit?
  4. Identify the real blocker. Often the real problem is something the agent cannot observe — a stale config file, a missing dependency, a test fixture that does not exist. Find it before re-dispatching.
  5. Rewrite the spec or pivot. Either the spec needs more precision, or the approach needs to change entirely. Never re-dispatch the same spec unchanged after two failures.

The Rule

Two failures is the trigger. Stop-and-replan is the response. Find the real blocker before dispatching again. Agents doom-loop when specs are ambiguous or environments have drifted — and giving the same spec a third time just burns more hours on the same wall.