ASK KNOX
beta
LESSON 68

Inter-Agent Communication Patterns: Polling, Events, and Queues

How do agents in a fleet know when to act? Polling wastes cycles. File watchers and trigger files are free. Queue-based messaging scales. The communication pattern you choose determines your fleet's efficiency, latency, and operational complexity.

9 min read·Multi-Agent Orchestration

The content pipeline runs at 9 AM. The researcher finishes at 9:04. How does the writer know the researcher is done?

This question — how does one agent signal readiness to the next? — is the inter-agent communication problem. The answer you choose determines your fleet's efficiency, latency, debuggability, and operational complexity. Three patterns dominate: polling, event-driven triggers, and queue-based messaging. Each has a domain where it belongs.

Inter-Agent Communication Patterns — Polling vs Event-Driven

Pattern 1: Polling

The simplest pattern. Agent A runs. Agent B wants to know when A is done. B checks a status indicator at regular intervals until it sees "done."

import time

def wait_for_research(state_file, max_wait=300, interval=5):
    waited = 0
    while waited < max_wait:
        state = json.loads(open(state_file).read())
        if state.get("research_status") == "complete":
            return True
        time.sleep(interval)
        waited += interval
    raise TimeoutError("Research did not complete in time")

Polling works. It is dead-simple to implement. Every engineer understands it instantly.

The cost: wasted compute. If Agent A takes 8 minutes and Agent B polls every 5 seconds, B makes roughly 96 poll attempts before it gets a "done" signal. 95 of those were wasted work. In a fleet with six polling agents, you have 570 wasted poll attempts per run.

More importantly: if B is blocking on a poll, it is not doing anything else. It is consuming a process slot and compute time to check a file repeatedly. In a 24/7 system with dozens of pipeline runs per day, this accumulates.

When to use polling:

  • Wait time is under 30 seconds (polling overhead is negligible)
  • You need a quick prototype and will refactor later
  • You have fewer than 5 polling iterations before the expected completion

Pattern 2: Event-Driven with File Watchers and Trigger Files

The production-grade pattern for most multi-agent workflows. Agent A completes its task and writes a trigger file (research.done). A file watcher — running as a lightweight daemon — detects the new file and fires the downstream agent.

No polling. No wasted cycles. The consuming agent sleeps until the trigger file appears, then wakes immediately.

The full pattern:

Agent A (producer):

# Do the work
result = run_research()
write_results_file(result)
# Signal completion
open("research.done", "w").close()

File watcher (coordinator):

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class PipelineHandler(FileSystemEventHandler):
    def on_created(self, event):
        if event.src_path.endswith("research.done"):
            spawn_writer_agent()

observer = Observer()
observer.schedule(PipelineHandler(), path="./pipeline/", recursive=False)
observer.start()

Agent B (consumer): Spawned on-demand when the trigger fires. Reads the result file. Executes. Writes its own trigger when done.

The blog-autopilot pipeline uses this pattern end-to-end. Each stage writes a trigger file. The next stage spawns on the trigger. No stage polls. No stage blocks. The entire pipeline is event-driven from research to Discord notification.

When to use event-driven:

  • Wait time is variable or long (>30 seconds)
  • Tasks are long-running (content generation, video synthesis, large file processing)
  • You have a production pipeline running on schedule
  • Polling would create unacceptable compute waste

Pattern 3: Queue-Based Messaging

For fleets with more complex coordination requirements — multiple consumers, fan-out, prioritization, failure handling, guaranteed delivery — a message queue adds significant capability at the cost of operational complexity.

A queue-based setup:

  • Agent A pushes a message to a queue when it completes: {"task": "write_article", "research_id": "abc123"}
  • One or more consumer agents subscribe to the queue
  • Each message is delivered to exactly one consumer (work queue semantics) or all consumers (pub/sub semantics)
  • Failed messages go to a dead-letter queue for inspection
  • Priority messages can skip the queue depth

In a trading signal fleet where six market researchers are running in parallel and a single signal synthesizer needs to process each result as it arrives — in priority order, with failure handling — a queue is the right tool. The synthesizer does not poll. It does not use file watchers. It subscribes to the queue and processes messages as they arrive.

When to use queues:

  • You need fan-out to multiple consumers
  • You need work prioritization across tasks
  • You need dead-letter handling for failed tasks
  • You need guaranteed delivery with acknowledgment semantics
  • Your fleet has more than 10 agents and file watcher management becomes unwieldy

The Decision Framework

Is wait time < 30 seconds AND poll count < 10?
  → Polling (simple, no infrastructure needed)

Is wait time variable or long, with one producer and one consumer?
  → File watcher + trigger file (no infrastructure needed)

Do you need fan-out, prioritization, or guaranteed delivery?
  → Message queue (accept the operational complexity)

A content pipeline driven by OpenClaw: file watchers and trigger files throughout. Six pipeline runs per day, each with four stages, zero polling. The file watcher daemon uses 0.1% CPU and has never caused an incident.

The Foresight trading signal fleet running on a dedicated trading server: queue-based for signal aggregation. Six market researchers push results to a queue. The synthesizer processes them in priority order. Dead-letter queue captures failed signals for manual review. The queue is justified — the fleet needs exactly what queues provide.

Async Handoffs and "Fire and Forget"

One pattern worth naming explicitly: the orchestrator fires an agent and does not wait for the result. The agent runs, writes to the state layer, and the orchestrator reads the state layer at a later checkpoint.

This is the async handoff. The orchestrator does other work between dispatching agents and collecting their results. It is not blocking on any single agent. It is managing a fleet of concurrent work.

# Orchestrator: fire and move on
spawn_agent("researcher", spec=research_spec, run_id=run_id)
spawn_agent("analyst", spec=analyst_spec, run_id=run_id)  # parallel

# Do other orchestration work...

# Later checkpoint: collect results
wait_for_trigger(f"research.done.{run_id}", timeout=300)
wait_for_trigger(f="analyst.done.{run_id}", timeout=300)
synthesize(results=[read_state(run_id, "research"), read_state(run_id, "analyst")])

This is the pattern that converts serial execution time into parallel execution time in practice.

Lesson 68 Drill

For your next multi-agent workflow, before writing a single agent:

  1. Map every producer-consumer pair: which agent produces output that another agent needs?
  2. For each pair: what is the expected wait time between production and consumption?
  3. Apply the decision framework: polling, file watcher, or queue?
  4. Design the signal mechanism: what file is written, what event fires, what message is queued?

Document this as your communication architecture. It should take 30 minutes. It will save you hours of debugging inconsistent behavior from agents that are either polling inefficiently or not being triggered reliably.

Bottom Line

The communication pattern is not a detail. It is the nervous system of the fleet.

Polling is simple and wasteful. File watchers are free and reliable. Queues are powerful and expensive to operate. Match the pattern to the requirement. The simplest pattern that meets the latency and reliability bar is the right pattern.

Design the communication layer before you build the agents. It is much harder to retrofit than to design upfront.