A Production Multi-Agent System: Architecture Review

This lesson is a synthesis. Every concept from this track — role taxonomy, task decomposition, shared state, communication patterns, blast radius isolation — assembled into one production architecture that runs 24/7.

The case study: the blog-autopilot content pipeline. Not a theoretical exercise. A system that has run hundreds of times, failed in specific ways, been debugged and hardened, and now delivers article drafts to a GitHub PR at 9 AM every other morning without human intervention.

Walking through this architecture will show you how the pieces fit together — and what the failure modes look like in practice.

Anatomy of the Fleet

Trigger layer. A launchd cron fires every other day at 9 AM. It does one thing: call the orchestrator with the intent "generate a blog post." It does not execute any work. It fires the starting gun.

Orchestrator. The orchestrator reads the current state of the pipeline from state.json. It determines whether a run is already in progress (if so, it exits — one run at a time). It logs the start event to events.jsonl. It spawns the researcher agent.

Researcher agent. Receives the intent plus the current topics_backlog.json (a list of pending topics maintained across runs). Selects the highest-priority unaddressed topic. Runs web search, pulls relevant data, synthesizes a research brief. Writes research.json to the run directory. Writes research.done trigger file. Terminates.

Writer agent. The file watcher detects research.done. The orchestrator spawns the writer with research.json as input context. Writer produces the full article in MDX format with proper frontmatter. Writes article.mdx to the run directory. Writes writer.done trigger. Terminates.

Image agent and metadata agent (parallel). Both are spawned simultaneously when writer.done is detected. The image agent reads the article title and excerpt, generates a hero image via Gemini (the primary provider in its fallback chain), writes it to public/images/blog-autopilot/{slug}.png. The metadata agent extracts tags, category, read time, and validates frontmatter completeness. Both write their own trigger files when done.

Publisher agent. Spawns when both image.done and metadata.done are present. Commits all files to a feature branch. Opens a PR. Posts a Discord notification with the PR link. Writes publish.done. Run complete.

The State Layer in Practice

Each run creates a directory: runs/YYYY-MM-DDTHH-MM/. Inside:

runs/2026-03-11T09-00/
  state.json          # run status, stage tracking
  research.json       # researcher output
  article.mdx         # writer output
  images/             # image agent output
  metadata.json       # metadata agent output
  events.jsonl        # append-only event log
  research.done       # trigger files
  writer.done
  image.done
  metadata.done
  publish.done

The state.json structure:

{
  "run_id": "2026-03-11T09-00",
  "status": "in_progress",
  "topic": "Multi-agent orchestration patterns",
  "stages": {
    "research": "complete",
    "write": "in_progress",
    "image": "pending",
    "metadata": "pending",
    "publish": "pending"
  },
  "started_at": "2026-03-11T09:00:00Z",
  "errors": []
}

The lock file lives at runs/2026-03-11T09-00/state.json.lock while any agent is writing to state.json. Every agent checks for the lock before writing. TTL: 30 seconds.

Failure Modes in Production

Five failure modes this pipeline has actually encountered:

1. Researcher API rate-limited. The web search API returns a 429. First retry: wait 60 seconds, retry. Second failure: halt researcher, write ERROR: research_failed to state.json, send Discord alert, exit. Run is marked failed. The topics backlog entry is not consumed — it will be selected on the next scheduled run.

2. Writer produces a blank article. This is the silent failure mode. The writer runs successfully, produces a file, writes the trigger — but the article is empty or malformed. The image agent and metadata agent both run and produce valid output. The publisher opens a PR with a blank article. Discovery happens at PR review, not at pipeline runtime.

The fix: the metadata agent validates article length and presence of required frontmatter fields. If validation fails, it writes ERROR: validation_failed instead of metadata.done. The publisher never spawns. The Discord notification reports the validation failure with specifics.

3. Image generation fails. Gemini rate-limited, Leonardo credits exhausted, API timeout. The image agent implements the fallback chain: Gemini → Leonardo → OpenAI DALL-E. If all three fail, it writes a default placeholder image path to state.json and proceeds. The publisher opens a PR with a placeholder image — better than no PR.

4. Stale lock file. Agent crashes mid-write, lock file remains. The TTL check runs every 30 seconds; any lock older than 30 seconds is automatically cleared with a warning event logged. This has triggered four times in production, always from the same cause: the orchestrator running out of memory mid-write during periods of high system load on the Mac Mini.

5. Duplicate run triggered. The cron fires correctly, but a previous run is still in progress. The orchestrator checks for a run directory with state.json containing "status": "in_progress". If found, the new invocation exits immediately and logs a warning. No duplicate run occurs.

Fleet Health Monitoring

The Watchdog Service monitors this fleet continuously:

Log staleness. If events.jsonl has not had a new entry in more than 30 minutes during a scheduled pipeline window, Watchdog Service fires a Discord alert: "Blog autopilot: no activity detected." This catches hung agents before the next scheduled run.

Pipeline duration. If a run has been in progress for more than 90 minutes (the run should complete in under 30), Watchdog Service fires an alert and kills the hung process.

Daily completion check. Every morning at 10 AM, Watchdog Service checks whether a PR was opened in the last 24 hours for the expected repository. If not, it sends a summary Discord notification: "Blog autopilot: no PR detected since last check."

This is the minimum viable fleet health monitoring setup. It does not require a separate monitoring service, a metrics database, or a dashboard. It is a 200-line Python script running on a launchd timer.

The parallel: production readiness is not hoping the pipeline succeeds. It is building the monitoring layer that detects failure immediately, the rollback path that recovers cleanly, and the failure protocols that contain damage to one stage rather than letting it propagate.

What the Architecture Looks Like Complete

When all the lessons of this track come together in one system:

The “The Orchestration Mental Model” lesson: The fleet exists because a single agent cannot hold the full content pipeline in one context window with consistent quality.
The “Agent Roles and Specialization” lesson: Five distinct agents (plus the orchestrator) with focused system prompts and explicit constraints. No agent does two jobs.
The “Task Decomposition” lesson: The dependency graph is mapped. Image and metadata are parallel (no dependency). Research → write → publish is sequential.
The “Shared State and Agent Memory” lesson: File-based state layer per run. Lock protocol on state.json. Append-only event log.
The “Inter-Agent Communication Patterns” lesson: Event-driven throughout. File watchers detect trigger files. Zero polling.
The “Blast Radius” lesson: Isolated run directories. Publisher only acts after metadata validation passes. 2-failure rule on researcher and writer.
The previous lesson (Platforms): No framework. Shell scripts + Python + Claude Code. Minimalist orchestrator.

This is not a theoretical architecture. It ran this morning.

Lesson Drill

Pick a workflow you currently do manually (or want to automate). Apply the full architecture review:

What is the dependency graph? Map it.
Which stages can run in parallel?
What goes in the state layer? What schema?
What communication pattern connects each stage?
What is the blast radius of each agent? How is it bounded?
What are the three most likely failure modes? How does each fail — loudly or silently?
What is the minimum viable health monitoring for this fleet?

Write those answers before you write the first line of code. That document is your architecture. The code implements it.

Bottom Line

A production multi-agent system is not a clever collection of prompts. It is a system: explicit state, defined roles, mapped dependencies, event-driven coordination, bounded blast radius, observable health, documented failure protocols.

The individual components — state files, trigger files, worktrees, lock protocols, event logs — are each simple. The discipline is assembling them correctly, in the right order, with explicit design decisions at every layer.

That is the difference between a demo that works once and a system that runs at 9 AM every other morning, reliably, without you watching it.

Build the system. Not the demo.