The Complete Platform
End-to-end walkthrough of a production agent operations platform: how expertise, team architecture, org wiring, authority delegation, and behavioral monitoring connect into a running system — and what to build next.
The previous six lessons built the components. This lesson assembles them — an end-to-end walkthrough of a production agent operations platform processing a real task, from the moment a directive enters the system to the moment the result is delivered.
Then it answers the question every builder asks after assembling the first working system: what do I build next?
The Platform Architecture
Every component has been covered. Here is how they connect:
External World
↓
Bridge Layer (Discord, cron, webhooks, HTTP)
↓
Principal Broker
├─ Agent Card Registry
├─ Routing Rules Engine (9 deterministic rules)
├─ Audit Log
└─ Offline Queues
↓
CEO Triage Engine
├─ Structured Report Parser
├─ Triage Rules Engine (12 rules)
└─ Authority Checker
↓
Team Skills / Specialist Dispatch
├─ Team Skill Definitions (YAML)
├─ Phase Execution (parallel + sequential)
└─ Territory Enforcement
↓
Specialist Agents
├─ Boot Protocol (seed load + memory hydration)
├─ Model Routing (task-type based)
├─ Domain Execution
└─ Shutdown Protocol (memory flush)
↑
Memory Layer (Akashic Records)
├─ Per-agent namespaces
└─ Shared org knowledge
↑
Health Monitor (separate process)
├─ Stale Execution Detector
├─ Doom Spiral Detector
├─ Hallucination Validator
└─ Circuit Breakers
End-to-End Walkthrough: Feature Request
Let's trace a feature request from Discord to merged PR.
T=0: Discord message arrives
Knox types in the #agent-tasks Discord channel:
!agent task: Add rate limiting to the /api/signals endpoint.
Max 100 requests per minute per user.
T=0.1: Bridge script translates
# discord_bridge.py processes the message
directive = Directive(
id="dir-a7f3c9",
source="discord",
sender_id="human-operator",
type="task",
domain="coding",
description="Add rate limiting to /api/signals endpoint. Max 100 req/min/user.",
priority="normal",
created_at=datetime.utcnow(),
)
await broker.route(directive)
T=0.2: Broker routes
Routing rule R2 (domain match) fires: "coding" domain → coding-agent-01 is ready → route.
The directive transitions from pending to acknowledged as the broker delivers to the coding agent.
T=0.3: CEO triage processes
The directive passes through triage: status "new task", confidence N/A (not a report), auto_resolvable=True for standard feature tasks, blast radius "single-repo". Rule R5 fires: completion-report pattern. The triage engine dispatches to the feature-team skill.
T=0.5: Team skill initialized
The feature-team skill activates two specialists for this task: Backend Developer and QA Engineer. The frontend dev is not needed for an API-only change.
skill = TeamSkill.load("feature-team")
session = await skill.execute(
directive=directive,
participants=["backend-dev-01", "qa-engineer-01"]
)
T=1: Backend Dev boots
# Backend agent boot protocol
seed = load_seed_file("agents/backend-dev/seed.md")
context = await mind_query(
query="rate limiting patterns, signals endpoint, recent API changes",
namespace="backend-dev-01",
limit=10
)
# Memory returns:
# - Prior rate limiting implementation on /api/trades (2026-02-15)
# - Known pattern: use Redis sliding window with lua script
# - Signals endpoint structure from last PR review
The agent starts with the prior rate limiting implementation already in context. No re-explanation needed.
T=3: Backend Dev implements
Model routing: the task is standard implementation → Sonnet.
# backend/middleware/rate_limit.py
import redis
import time
from typing import Optional
class SlidingWindowRateLimiter:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.lua_script = self.redis.register_script("""
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
-- Remove entries outside window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count current entries
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now)
redis.call('EXPIRE', key, window)
return 1 -- allowed
end
return 0 -- denied
""")
def check(
self,
user_id: str,
endpoint: str,
limit: int = 100,
window_seconds: int = 60
) -> bool:
key = f"rate_limit:{endpoint}:{user_id}"
now = int(time.time() * 1000) # milliseconds
result = self.lua_script(
keys=[key],
args=[now, window_seconds * 1000, limit]
)
return bool(result)
T=8: Backend Dev files completion report
report = AgentReport(
agent_id="backend-dev-01",
report_type="completion",
headline="Rate limiting implemented on /api/signals",
status="ok",
confidence=0.95,
findings=[
Finding(severity="info", description="Used sliding window Redis pattern from prior implementation"),
Finding(severity="info", description="Added 3 unit tests: allow, deny, window reset"),
],
recommendation="QA to run integration tests",
auto_resolvable=True,
blast_radius="single-repo",
)
T=8.5: QA Engineer activates
The team skill's phase gate passes (backend complete), QA activates in parallel to the backend agent's shutdown.
T=12: QA Engineer completes
Integration tests pass. QA files a completion report. Confidence: 0.94.
T=13: PR created
The backend agent's shutdown protocol runs:
# Store the implementation pattern to memory
await mind_remember(
content="Rate limiting on /api/signals: SlidingWindowRateLimiter "
"with Redis ZADD/ZREMRANGEBYSCORE lua script. "
"100 req/min per user. Key format: rate_limit:{endpoint}:{user_id}. "
"Tests in tests/backend/test_rate_limit.py. PR: #247.",
category="coding",
tags=["rate-limiting", "redis", "signals-endpoint", "implementation"],
type="episodic"
)
PR #247 is created. CI runs. Tests pass. The broker transitions the directive to completed.
T=13.5: Discord notification
The bridge script reports back to the #agent-tasks channel:
Directive dir-a7f3c9: COMPLETED
PR #247 opened: Add rate limiting to /api/signals
Tests: 3 unit + 2 integration — all pass
Duration: 13 minutes
The total wall-clock time from Discord message to open PR with passing tests: 13 minutes. No human wrote a line of code.
Startup Sequence
Getting the platform running the first time requires a specific startup order:
#!/bin/bash
# start-platform.sh
# 1. Memory system first — agents need it on boot
docker compose up -d akashic-records
sleep 5 # Wait for Akashic to be ready
# 2. Broker — agents register with it on boot
python -m broker.main &
sleep 3
# 3. Agents — each runs boot protocol on start
python -m agents.backend_dev &
python -m agents.qa_engineer &
python -m agents.trading_agent &
python -m agents.content_agent &
sleep 5
# 4. Health monitor — needs agents to be running
python -m health.monitor &
# 5. Bridges — start accepting external input last
python -m bridges.discord_bridge &
python -m bridges.cron_bridge &
echo "Platform running"
The Daily Operating Pattern
Once running, the platform operates with minimal human input. The human's daily interaction with the platform:
Morning: Read the digest
Daily Digest — 2026-03-30
Directives: 47 total
├─ Auto-resolved: 43
├─ Escalated (internal): 3
└─ Escalated (human): 1 ← review needed
Cost: $2.14 (budget: $5.00/day)
Health: All agents green
Open PRs: 3 (2 in CI, 1 awaiting review)
1 Human Escalation:
trading-agent: "New market pattern not in seed knowledge"
→ Needs: updated strategy params or explicit guidance
As needed: Review the one escalation, provide guidance.
Weekly: Review the Akashic memory store for each agent — look for patterns in what's being stored, update seed files as operational knowledge matures.
Monthly: Update authority tiers based on demonstrated agent reliability, refine triage rules based on false escalations, retire stale memory entries.
What to Build Next
After the core platform is running and processing real work, four extensions matter most.
1. Mission Control Dashboard
Before expanding the fleet, build visibility. A dashboard showing:
- All agents: status, current directive, circuit breaker state
- Directive queue: pending, in-progress, completed, failed
- Cost per agent per day with budget burn rate
- Health monitor alerts, active and resolved
- Memory growth per agent namespace
Without visibility, you manage the fleet blind. Build this before adding agents.
2. Automated Knowledge Curation
The Akashic memory store grows over time. Some entries are still relevant; others are outdated. Build an automated curation process:
async def curate_memory_namespace(agent_id: str, namespace: str) -> CurationResult:
"""
Weekly: review memory entries, promote stable ones to seed files,
archive stale ones, surface knowledge gaps.
"""
entries = await mind_query(namespace=namespace, limit=100)
for entry in entries:
age_days = (datetime.utcnow() - entry.created_at).days
access_count = entry.access_count
if age_days > 90 and access_count == 0:
await mind_forget(entry.id) # stale, unused
elif age_days > 30 and access_count > 10:
# Frequently accessed, stable — promote to seed file
await promote_to_seed(entry, agent_id)
3. Cross-Agent Knowledge Sharing
Some discoveries should propagate across agents. When the coding agent discovers a new API pattern, the QA agent should know. Build a knowledge propagation system:
# When an agent stores a memory entry with propagation flag
await mind_remember(
content="Supabase RLS policy bug: policies are not evaluated for service_role key...",
category="coding",
tags=["supabase", "rls", "bug", "org-wide-knowledge"],
propagate_to=["qa-engineer", "content-agent"], # who else needs to know
)
4. Incident Replay
When things go wrong, you need to reconstruct what happened. Build an incident replay system from the audit log:
async def replay_incident(
incident_start: datetime,
incident_end: datetime
) -> IncidentTimeline:
"""
Reconstruct the sequence of events from the audit log.
"""
events = await audit_log.query(
from_time=incident_start,
to_time=incident_end,
include_directives=True,
include_agent_state=True,
include_health_events=True,
)
return build_timeline(events)
The Compounding Return
A one-agent system has linear returns: more work requires proportionally more human time.
A platform has compounding returns: the work grows, but the human time required does not. Agents accumulate expertise. Triage rules improve with each false positive fixed. Memory grows richer with each session. The platform gets better at being the platform.
The first month feels like infrastructure investment. By month six, it feels like leverage.
The final measurement that matters: how much work did the platform process this week that would have required your direct attention? Track that number. Watch it grow. That is the return on the infrastructure you built.
Summary
- The platform components connect in a specific order: memory → broker → agents → health monitor → bridges
- The startup sequence matters: agents depend on Akashic being ready; health monitor depends on agents running
- A real task flows in ~13 minutes from Discord message to merged PR with passing tests
- The daily operating pattern centers on a digest — not a firehose
- Build Mission Control before expanding the fleet — visibility enables management
- Automated memory curation, cross-agent knowledge sharing, and incident replay are the highest-value extensions
- The compounding return is the whole point: the platform gets better at being the platform
Track Complete
You now have the complete blueprint: from the problem with stateless generalists to a running production platform. The concepts are transferable — the seed file pattern, the boot/shutdown protocol, deterministic routing, authority ceilings, circuit breakers. These apply whether you are building on OpenClaw, on raw Claude Code sessions, or on any other agent execution environment.
The platform that runs well at three agents scales to thirty. Build it right once.