Orchestration Platforms: Build vs. Buy

There is a decision that destroys more multi-agent projects than any architectural mistake: adopting a platform before understanding whether the platform's capabilities match what the workflow actually needs.

LangGraph is a capable framework. CrewAI is accessible. AutoGen has compelling research behind it. None of that matters if your content pipeline has four stages, a linear dependency chain, and zero requirements for graph-state checkpointing, multi-consumer fan-out, or cross-run tracing. For that pipeline, you need a Python script, four functions, and three files. Adding LangGraph adds a dependency, an abstraction layer, an upgrade risk, and a debugging surface. It does not add capability you are using.

The Minimalist Orchestrator

The minimalist orchestrator is not a framework. It is a pattern:

A shell script or Python function as the entry point
Subprocess calls or background agent spawns for parallel work
File-based state for coordination (JSON files, trigger files, event log)
Direct Claude invocations via the API or Claude Code CLI
A Discord notification at the end

That is Agent Gateway's blog-autopilot. That is the signal-drop trading-analysis pipeline. That is most of what runs on this system. Every cron job that fires, executes, and delivers to Discord is a minimalist orchestrator.

The minimalist pattern handles:

Sequential pipelines (run A, then B, then C)
Parallel fan-out (run A, B, C simultaneously, wait for all three)
Fan-in (collect A, B, C results, synthesize)
Error handling (catch exceptions, log to event log, notify)
Retry logic (attempt twice, then halt and alert)

That covers 80% of production multi-agent workflows. The remaining 20% have requirements that justify a platform.

When to Consider a Platform

LangGraph — When your workflow is a stateful graph where nodes can loop, branch conditionally on intermediate results, and need checkpoint-and-resume capability. If a workflow runs for 45 minutes and fails at step 8, can you resume from step 8 without rerunning steps 1-7? LangGraph makes this straightforward. Without it, you build the checkpointing yourself (which you can, but it takes time).

Best for: complex multi-step workflows with conditional branching, long-running processes that need fault tolerance, teams that benefit from LangGraph's observability tooling.

Avoid when: your workflow is linear, your steps are short, or you are on the LangChain ecosystem upgrade treadmill.

CrewAI — When you want to define agents and tasks declaratively in YAML and prototype quickly. Good for demos, PoCs, and use cases where the role taxonomy maps cleanly to CrewAI's Agent/Task/Crew abstractions.

Avoid in production: CrewAI's state management is limited, its error handling requires significant custom work to production-harden, and its abstraction layer makes debugging difficult when something goes wrong at 3 AM.

AutoGen — When you need multi-agent conversation loops where agents negotiate, debate, and iterate toward a solution. Code generation with a coder agent and a critic agent that exchange feedback until the code passes tests — AutoGen's conversational model fits this pattern well. Note the landscape moved: Microsoft folded AutoGen into the Microsoft Agent Framework (announced October 2025), so AutoGen itself is effectively in maintenance — new builds on the conversational-loop pattern should target the successor.

Avoid when: your workflow is not conversational. Forcing a sequential pipeline into a conversation loop adds complexity without benefit.

Vendor agent SDKs — The OpenAI Agents SDK and Anthropic's Claude Agent SDK give you a thin, provider-native harness (tool calling, handoffs, run loops) without the heavier graph or crew abstractions. Best when you are committed to one model provider and want the official agent loop rather than a framework on top of it; the same caveats from lesson 178 apply (neither injects resource constraints for you).

The Agent Gateway-Style Orchestrator

The most sophisticated orchestration pattern I run does not use any of the above frameworks. It uses Discord as the input channel, a persistent Python daemon as the orchestrator, and Claude Code as the execution engine.

The architecture:

A user sends a message to a Discord channel ("generate a market analysis for BTC this week")
Agent Gateway receives the message, classifies intent, identifies the appropriate skill
The skill script receives the intent, decomposes it into subtasks, spawns Claude Code agents
Agents execute with git worktrees for isolation, write results to the state layer, emit trigger files
The next stage fires on the trigger, executes, writes its results
Final output goes to Discord: the user gets a notification with the finished artifact

No LangGraph nodes. No CrewAI crews. A Python daemon, a bash script per skill, and the Claude Code CLI.

The Decision Framework

Before adopting any platform, answer these questions:

Does your workflow need graph-state checkpointing? If a 45-minute run fails at minute 40, do you need to resume from the failure point rather than start over? If yes, consider LangGraph or build checkpointing into your state layer.

Does your workflow need multi-consumer fan-out with guaranteed delivery? If yes, you need a message queue (Redis, RabbitMQ, a managed service). No framework adds this — it is infrastructure you provision.

Does your workflow need team-level observability? If multiple engineers need to inspect runs, trace failures, and understand agent behavior through a shared UI, LangGraph's LangSmith integration adds real value.

Is your workflow conversational or sequential? Conversational (agents negotiate) → AutoGen. Sequential (output of A feeds B) → shell scripts.

If you answered "no" to all four, you are in the 80%. Use the minimalist orchestrator.

The minimalist orchestrator appears weak — it is just files and shell scripts. But it is strong where it matters: debuggable, portable, zero external dependencies, restartable on any machine with Python and Claude Code installed. Platform complexity is power only when the problem requires it.

The Upgrade Path

A well-designed minimalist orchestrator can be upgraded to a platform incrementally if requirements grow. The state layer is already files — wrap it in an event store. The trigger files are already events — route them to a queue. The agent spawns are already decoupled — add graph structure around them.

This is the opposite of the failure mode: adopting a platform first because it looks comprehensive, then fighting its abstractions for every simple use case.

Start simple. Design for extensibility. Upgrade when requirements demand it.

Lesson Drill

For your current or next multi-agent project:

Answer the four decision questions above for your specific workflow
Map which capabilities you actually need versus which ones a platform provides
Choose the simplest option that covers your actual requirements
Write down the specific requirements that would cause you to upgrade to a more capable platform in the future

That last question is your exit criterion. When those requirements materialize, upgrade. Until then, stay minimal.

Bottom Line

The orchestration platform market is full of compelling options. Most production workflows do not need any of them.

Shell scripts + file state + Claude handles the majority of multi-agent coordination requirements with zero framework overhead, full debuggability, and zero upgrade risk.

The platforms earn their cost when you hit specific requirements: graph-state checkpointing, guaranteed delivery, conversational agent loops, team observability. Before you hit those requirements, they are complexity you are carrying for no return.

Survey what you actually need. Build the simplest thing that provides it. Upgrade when requirements exceed it. That sequence produces production systems that are reliable and maintainable.