The Audit Swarm — Four Agents, One Codebase, Zero Blind Spots
Running a single AI reviewer over a codebase gives you one perspective. Running four specialized domain agents in parallel gives you a comprehensive audit in the same amount of time — with findings categorized, prioritized, and ready to act on.
Most code reviews are sequential. One reviewer reads the codebase, takes notes, produces a report. The problem with that model is not the quality of the reviewer — it is the cognitive cost of context switching. The same person cannot simultaneously think like a security auditor probing for injection vectors, an architect analyzing coupling patterns, a performance engineer measuring hot paths, and a testing engineer identifying coverage gaps. Switching between these mental models degrades each one.
The audit swarm solves this with specialization and parallelism. Instead of one reviewer context-switching, you run four agents simultaneously, each carrying a single deep domain model, each writing to its own output file. You get four expert perspectives in the same wall-clock time it would take one reviewer to produce one perspective.
This lesson breaks down exactly how the swarm works, what each agent looks for, and what the output structure looks like. The next lessons cover prioritization and dispatch — but the foundation is the swarm architecture itself.
The Four Domain Agents
The audit swarm dispatches four agents. Each receives the same codebase path and a domain-specific system prompt. Each writes an independent findings file. None of them wait for the others.
Agent 1 — Security Auditor
The security agent operates with a threat-modeling mindset. It reads authentication and authorization code first, then follows the data flow from external input to persistence. It looks for:
- Auth bypass patterns: middleware that accepts any non-empty token, endpoints that skip authorization checks, hardcoded credentials
- Injection vectors: SQL query construction via f-strings, shell commands with interpolated arguments, deserialization of untrusted input
- Token and secret handling: raw tokens logged at INFO level, secrets stored in audit tables, tokens returned in response bodies without access control
- Privilege escalation: any agent rotating another agent's token, unregistered agents granted full authority, synthetic test endpoints exposed in production
The security agent's output format is deliberate: every finding includes the exact file path and line numbers, a description of the vulnerability class, and the potential impact. This is not soft feedback — it is a bug report.
In the Principal Broker audit, the security agent found 16 findings across P0 to P2. The two P0 findings it surfaced were deployment blockers: an auth middleware that accepted any non-empty bearer token (meaning the entire broker ran unauthenticated), and a SQL injection vector built via f-string interpolation. Both were fixed within hours of the audit completing.
Agent 2 — Architecture Analyst
The architecture agent reads for structure rather than behavior. It is looking for coupling violations, dead code, and systemic design problems that create correctional debt. It looks for:
- Dead code and missing wiring: fully implemented subsystems that are never called from the main pipeline
- Layer violations: business logic inside API handlers, infrastructure concerns leaking into domain models
- Fragile path derivation: DB paths computed by string manipulation on other paths, which fails silently when the derivation assumption breaks
- Stub implementations returning success: safety-critical functions (
_revoke_all_tokens,_lock_env_files) that returnTruewhile doing nothing — the system reports success on operations that never happen
The architectural findings from the Principal Broker audit included three P0 discoveries that were more dangerous than they appeared. The finops subsystem (cost tracking, budget enforcement, loop detection) was fully implemented — 400+ lines of production code — and never wired into the main message pipeline. It could not execute. Same for the feedback protocols. These were not TODO comments or stubs. They were complete implementations that were never connected.
Agent 3 — Performance Engineer
The performance agent looks for patterns that degrade throughput under load. It is particularly sensitive to operations on hot paths — the code that runs on every message, every request, every event. It looks for:
- Connection-per-call anti-patterns: SQLite connections created and destroyed on every call in a crisis path, HTTP clients instantiated per-message instead of shared
- Sync operations blocking the event loop: synchronous SQLite commits inside async handlers, fsync barriers on every message
- Redundant computation: the same query called twice consecutively at startup, no caching on hot-path lookups
- Resource lifecycle problems: resources created inside handler functions that should live in application state
In the Principal Broker audit, the performance agent found 28 findings across P1 to P3. The most impactful was a connection-per-call pattern in the halt store — the crisis-path code that needed to be fast was creating a new SQLite connection on every call. Fixed: the connection is now persistent, opened once at startup, reused for the lifetime of the process.
Agent 4 — Testing Engineer
The testing agent reads coverage reports alongside source code. It does not just count uncovered lines — it evaluates the risk posture of the gaps. A missing test on a utility function is different from a missing test on the composition of a safety pipeline. It looks for:
- Composition gaps: individual components tested in isolation but their wiring in the actual runtime handler untested
- Placeholder tests: test files that exist solely to prevent
pytest exit code 5, containing onlypassor trivial truthiness assertions - Coverage below floor: the 90% floor is not arbitrary — the agent looks for which files are dragging coverage down and quantifies the ROI of covering them
- Missing regression tests: bug classes that are known to exist but have no test preventing re-introduction
The Principal Broker testing audit's most dangerous finding was a composition gap: every safety component (hard blocks, authority ceilings, audit-before-dispatch) had unit tests in isolation, but the 7-step message pipeline that composed them in _make_message_handler had zero tests. A wiring bug — wrong order, missing await, exception swallowing — would pass all existing tests and only surface in production.
The Parallel Execution Model
The four agents run concurrently. Each one:
- Receives the codebase path and its domain-specific prompt
- Reads files relevant to its domain
- Writes findings to its output file:
security.md,architecture.md,performance.md,testing.md - Completes independently
Because each agent writes to its own file, there is no coordination overhead and no merge conflicts. The outputs are additive.
After all four complete, a synthesis step aggregates findings into MASTER-SUMMARY.md. This file:
- Consolidates all findings into a unified priority table (P0 through P3)
- Identifies systemic issues that span multiple domains
- Produces the ordered fix list for the resolution phase
The synthesis step takes minutes. The domain agents take 10-15 minutes each. Running them in parallel means the total audit time is the duration of the slowest agent, not the sum of all agents.
What the Output Looks Like
Each domain report follows a consistent schema:
| Priority | Category | Location | Issue | Status |
Every finding has an exact location (file path, line numbers), a category (injection, auth bypass, connection anti-pattern, coverage gap, etc.), and a clear statement of impact. No vague feedback. No "consider refactoring." Specific, addressable findings.
The MASTER-SUMMARY.md aggregates these into a priority table:
| Priority | Count | Status |
|----------|-------|--------|
| P0 | 7 | 7 fixed |
| P1 | 26 | 26 fixed |
| P2 | 45 | 45 fixed |
| P3 | 6 | 6 fixed |
| Total | 84 | 84 fixed, 0 open |
84 findings across four domains. Produced in parallel. Ready to prioritize and resolve.
Why Specialization Compounds
The value of domain specialization is not additive — it compounds. A security agent running through auth code without the cognitive overhead of simultaneously evaluating test coverage produces a different quality of security finding. It can hold the entire threat model in its context window. It can trace the full auth bypass path from entry point to exploitation. It can identify patterns that a general reviewer would miss because the general reviewer's attention is divided.
Multiply that effect across four domains and you have a compound advantage: not four audits added together, but four deep audits each operating at full capacity.
In the Principal Broker case, this produced 84 findings in a single session. Many of those findings would not have surfaced in a standard code review. The SQL injection via f-string in okr_sentinel_metrics.py is exactly the kind of finding that gets missed when the reviewer is also thinking about test coverage and coupling patterns. The security agent found it because that is all it was thinking about.
Lesson 217 Drill
Design your own audit swarm for a codebase you own. Map out:
- What four domains matter most for your system?
- What are the top 5 things each domain agent should look for, specific to your tech stack?
- What output format would make findings most actionable for your team?
- Where does your current review process produce shallow findings because of context switching?
You do not need to run the swarm today. The design exercise builds the mental model. When you are ready to run it, the structure is already there.
Bottom Line
The audit swarm is a force multiplier. Four specialized agents running in parallel produce better findings than one general reviewer running sequentially — not because any single agent is smarter, but because specialization eliminates context switching and parallelism eliminates serial wait time. The output is a structured, prioritized findings document ready for the resolution phase covered in Lesson 218.