Ask Knox

The fleet has a physical split: persistent services run on the production server, and trading bots run on Tesseract (the dedicated trading machine at 192.0.2.10). These are two separate machines on the same local network.

This creates a transport problem. When Foresight on Tesseract publishes a trade.executed message, how does it reach the Agent Broker on the production server? And when the broker wants to deliver a directive to Foresight, how does it cross the network?

The answer is NATS, and the binding configuration is one of five architectural decisions that are non-negotiable in the Agent Broker codebase.

Why NATS

The transport requirements for the Agent Broker were:

Pub/sub fan-out — InDecision Engine publishes one signal; Foresight, Sports Prediction Agent (sports-bot), Signal Engine (signal-engine), and Perp Bot (perp-bot) all receive it simultaneously
Topic-based routing — Each agent has its own inbox (agents.foresight.inbox) and outbox (agents.foresight.outbox) without configuration per-pair
Low latency — The broker sits on the hot path of trade execution; transport overhead needs to be sub-millisecond locally
Simple operations — One binary, no cluster required for single-node, no separate ZooKeeper or consensus layer
Cross-machine — The same NATS instance must be reachable from both the production server and Tesseract

NATS satisfies all five. Kafka is operationally heavier and designed for persistent log-based workflows. RabbitMQ requires broker setup and queue management. Redis pub/sub does fan-out, but it is fire-and-forget — no message persistence, no delivery guarantees, and no hierarchical subject wildcards — and it drags in a Redis server dependency for a messaging-only job. NATS is a single binary that runs as a launchd daemon, starts in under a second, and requires zero configuration for the single-node case.

The Binding Rule

The most operationally critical NATS decision is documented in ARCHITECTURE-DECISIONS.md:

The NATS server must bind to 0.0.0.0:4222, not 127.0.0.1:4222. Tesseract agents communicate with the broker's NATS over the local network. If NATS is only listening locally, Foresight cannot reach the broker.

This is the kind of detail that causes a production outage the first time someone regenerates the NATS config. If NATS binds only to localhost, everything works perfectly in local testing. The moment Foresight on Tesseract tries to connect, it gets a connection refused. The bots lose their communication channel with no error message that makes the cause obvious.

The binding rule is in the CLAUDE.md for the project, in the architectural decisions document, and will be in the operator runbook. It is documented three times because it will be forgotten and it will cause an outage if it is.

Topic Conventions

Every agent in the fleet follows the same NATS topic structure:

agents.{agent_id}.outbox   — messages FROM the agent TO the broker
agents.{agent_id}.inbox    — messages FROM the broker TO the agent
agents.{agent_id}.heartbeat — periodic health pulses from the agent

The broker subscribes to two wildcard patterns on startup:

await nc.subscribe("agents.*.outbox", cb=_make_message_handler(app))
await nc.subscribe("agents.*.heartbeat", cb=_make_heartbeat_handler(app))

The *. wildcard matches any agent ID. A new agent that publishes to agents.newbot.outbox is handled automatically without any broker configuration change.

Dispatching uses the inbox pattern:

topic = f"agents.{agent_id}.inbox"
await self.nats.publish(topic, payload)

Each agent subscribes to its own inbox. An agent that is offline simply does not have a subscriber on its inbox topic. The broker publishes to the topic regardless — the message is not buffered by NATS by default. This is the fire-and-forget model: the broker's job is to publish; the agent's job is to be online to receive.

For agents that need offline resilience, the SDK handles local buffering (covered in the “The SDK Pattern” lesson).

Pub/Sub vs. Request/Reply

NATS supports two communication patterns: pub/sub and request/reply. The Agent Broker uses pub/sub for all agent-to-broker and broker-to-agent communication.

Pub/sub means the publisher sends a message to a topic and any number of subscribers receive it. The publisher does not wait for a response. This is the right model for event-driven agent communication — Foresight does not need to know that VP Trading received its trade.executed message; it just needs to know the message was published.

Request/reply means the sender publishes to a topic and waits for a response on a reply topic. NATS supports this natively with inbox subscriptions. The Agent Broker exposes HTTP REST endpoints (/v1/messages, /v1/directives) for synchronous request patterns where a caller needs an immediate response.

The choice of pub/sub as the primary pattern means agents are decoupled from each other by default. Foresight does not need to know that InDecision Engine is online to publish a signal query. The broker handles delivery (or detects unavailability through heartbeat monitoring).

Fan-Out in Practice

Rule 4 illustrates NATS fan-out in practice. When InDecision Engine publishes a signal.published message:

trading_agents = self._get_trading_agents()
return RouteDecision(
    primary_recipients=trading_agents,
    fan_out=[],
    store_in_memory=False,
    priority_override="high",
)

The dispatcher iterates over ["foresight", "sports-bot", "signal-engine", "perp-bot"] and publishes to each inbox in sequence:

for agent_id in all_recipients:
    topic = f"agents.{agent_id}.inbox"
    await self.nats.publish(topic, payload)

These four publishes happen within a single async iteration — no waiting for acknowledgment between them. NATS publishes are non-blocking in the async model. The four messages hit the NATS server in rapid succession, and all four agents receive them effectively simultaneously.

This is the fan-out advantage over polling. In a polling model, each bot checks for signals on its own schedule. In the NATS pub/sub model, all bots receive the signal within milliseconds of publication.

Heartbeat Monitoring

The heartbeat subscription provides passive health monitoring without requiring active probing:

async def handle_heartbeat(msg):
    agent_id = msg.subject.split(".")[1]
    updated = app.state.registry.update_last_seen(agent_id)
    if not updated:
        logger.debug(f"Heartbeat from unknown agent: {agent_id}")

Each agent using the SDK publishes to agents.{agent_id}.heartbeat every 30 seconds:

async def _heartbeat_loop(self):
    while True:
        await asyncio.sleep(HEARTBEAT_INTERVAL)
        if self._nc and self._nc.is_connected:
            topic = f"agents.{self.agent_id}.heartbeat"
            payload = json.dumps({
                "agent_id": self.agent_id,
                "timestamp": datetime.now(timezone.utc).isoformat(),
                "session_id": self._session_id,
            }).encode()
            await self._nc.publish(topic, payload)

The broker's background heartbeat monitor checks every 30 seconds for agents whose last_seen timestamp has exceeded their configured timeout. When a timeout is detected, the registry updates the agent's status and the broker can fire an escalation.

For InDecision Engine — the highest-priority shared service — a heartbeat timeout triggers an immediate escalation to VP Engineering with fan-out to Agent Gateway. The 30-second heartbeat interval means the broker detects a failure within 60 seconds of it occurring (one missed beat triggers the timeout check on the next monitor cycle).

NATS Connection Configuration

The broker connects to NATS during the startup lifespan with a minimal config:

nc = await nats_lib.connect(
    config.nats_url,
    connect_timeout=2,
    max_reconnect_attempts=0,
)

max_reconnect_attempts=0 means the broker does not automatically reconnect if NATS goes offline. This is intentional: if NATS goes offline, the broker should fail loudly and let Watchdog Service restart it in a clean state rather than attempting reconnection with potentially stale subscription state.

The SDK, by contrast, uses aggressive reconnection:

self._nc = await nats.connect(
    self.nats_url,
    connect_timeout=5,
    max_reconnect_attempts=10,
    reconnect_time_wait=2,
)

Agents should reconnect automatically — they have local buffers to drain on reconnection. The broker should fail and restart — it has the audit log and restart logic to ensure clean recovery.

Operational Notes

The NATS monitoring endpoint at http://localhost:8222/varz provides connection counts, message rates, and subject subscription lists. This is the first place to check when debugging why an agent's messages are not reaching the broker — if the topic does not appear in the subscription list, the broker's wildcard subscription has not registered for it, which indicates the broker is not connected or the topic naming convention is wrong.

When NATS Is Unavailable

The broker handles NATS unavailability gracefully. During startup, if NATS connection fails, the broker logs a warning and enters API-only mode:

try:
    nc = await nats_lib.connect(config.nats_url, connect_timeout=2, ...)
    app.state.nats = nc
except Exception as e:
    logger.warning(f"NATS unavailable — running API-only mode: {e}")
    app.state.nats = None

In API-only mode, the HTTP REST endpoints (/v1/messages, /v1/agents, /v1/audit) remain fully functional. Agents can submit messages via HTTP POST. The broker routes, audits, and stores them normally. The only capability lost is NATS-based delivery to agent inboxes.

This degraded-but-functional behavior is important for the development workflow — the broker can be run and tested without a NATS server for most operations. It is also the behavior during NATS maintenance windows, where critical REST-based operations need to continue.

NATS as Internal Transport