Ask Knox

On April 13, 2026, Knox ran a five-agent audit swarm across the entire academy — 277 lessons in a single pass. The swarm produced 341 findings. No human could have done this in the same timeframe, let alone with the same coverage.

This is the audit swarm pattern: a coordinated multi-agent architecture that makes AI-scale content auditing tractable. The pattern applies to any corpus that is too large for human review but too important to go unreviewed — code repositories, documentation libraries, training datasets, policy documents.

Why AI-Built Systems Need AI-Powered Audits

The problem is asymmetry. A human reviewer can meaningfully read and verify perhaps 20-30 lessons per day with focus — checking technical accuracy, validating code examples, verifying API references, and assessing pedagogical quality. At that rate, a 277-lesson corpus takes two to three weeks of sustained review.

An audit swarm with N parallel Auditor instances can cover the same corpus in a single session. Not because the agents work faster on any individual lesson, but because they work simultaneously. While one Auditor evaluates lessons 1-5, another evaluates lessons 50-54, another 100-104, and so on. The five roles below define the architecture; the Auditor role scales to as many parallel instances as the corpus demands.

The human reviewer's role shifts from execution to architecture and governance: design the rubric, review the Fact-Checker's output, make judgment calls on ambiguous findings, and decide which systemic patterns to address first.

The Five Roles (With N Parallel Auditors)

The academy audit used five specialized roles, each with a distinct function. Four are singleton roles; one — the Auditor — scales horizontally to as many parallel instances as the corpus requires:

Role 1 — The Registrar

The Registrar's job is enumeration. It catalogs every piece of content in the corpus: every lesson number, every track, every slug, every frontmatter field. It produces a structured manifest that the Navigator uses to divide work.

The Registrar runs first and runs alone. Its output is the foundation for everything else. A manifest error — a missing lesson, a wrong slug, a corrupted frontmatter parse — propagates into every Auditor's assignment. Validate the Registrar's output before proceeding.

Role 2 — The Navigator

The Navigator takes the Registrar's manifest and divides it into chunks of approximately five lessons each. It assigns each chunk to an Auditor instance along with the full rubric.

The Navigator makes two key decisions: chunk size and assignment strategy. Five lessons per Auditor is the calibrated size for this corpus and context window budget. Larger chunks risk context overflow mid-audit. Smaller chunks increase overhead and reduce the cross-lesson pattern detection that comes from seeing multiple lessons simultaneously.

Role 3 — The Auditor (N parallel instances)

Each Auditor instance receives:

A list of 5 lesson numbers
The full lesson content for each
A structured rubric with explicit criteria per severity level
A required output format (structured JSON, not prose)

Auditors run in parallel. Each produces a list of findings for its assigned lessons. The rubric must be explicit — vague criteria produce vague findings. "Technical accuracy" is not a rubric criterion. "API endpoint matches current documentation" is.

Role 4 — The Fact-Checker

The Fact-Checker is the quality gate. It receives all findings from all Auditors and validates each one against authoritative sources before it reaches the final report.

Without the Fact-Checker, the swarm's output is raw — a mix of real issues and false positives, with no mechanism to distinguish them. The Fact-Checker eliminates the false positives. This is not optional. An unverified finding that reaches the final report gets actioned. An incorrect finding that gets actioned wastes engineering time and may introduce new errors.

Role 5 — The Compiler

The Compiler aggregates verified findings into the final report. It applies severity classification, identifies systemic patterns, and produces a prioritized action list. The Compiler's output is what the operator acts on.

Severity Framework

The academy audit used four severity levels. These map to priority and action type:

A CRITICAL finding requires immediate action before the lesson is assigned to a student. A HIGH finding is addressed in the current sprint. MEDIUM and LOW are batched and addressed in maintenance windows.

The key property of CRITICAL: it is not about the severity of the subject matter. It is about whether the finding causes the student to build something that does not work. Wrong API credentials format is CRITICAL. An outdated screenshot is LOW, regardless of how important the underlying feature is.

Systemic Patterns: The Leverage Point

The academy audit produced 341 individual findings. But the Compiler identified 10 systemic patterns that accounted for the majority of them.

Systemic patterns are not just "many instances of the same error." They are structural: a single root cause that produces dozens of downstream findings. Examples from the academy audit:

Outdated model IDs: every lesson that referenced a specific model ID was potentially wrong. Fixing the pattern (updating the reference list and propagating it) eliminated dozens of individual findings simultaneously.
Missing quiz answer variety: many lessons had quiz answers clustered at position 0 or 1. One rubric change in the generation template fixes this going forward.
Stale component imports: lessons generated before MDX import rules changed still had top-level import statements.

Fixing instance by instance instead of fixing the source is the most common mistake when acting on audit output. It costs more time and leaves the source in place to generate new instances.

What to Delegate vs. Review Personally

Not every finding is agent-safe to fix. The rule of thumb:

Agent-safe: string replacements, numbering corrections, formatting fixes, clearly wrong imports, definitive factual errors with a known correct value.

Human review required: quiz answer corrections (a wrong answer may be intentionally wrong to test misconceptions), security-related content (an agent rewriting security advice may introduce subtle vulnerabilities), architectural decisions (choosing between two valid approaches requires product judgment), and anything where the "correct" answer requires understanding Knox's specific operational context.

When you act on audit output, categorize findings by this rule before dispatching fix agents. Dispatching an agent to fix quiz answers that require judgment about the lesson's pedagogical intent is worse than leaving them for human review.

Applying the Pattern to Your Own Systems

The audit swarm pattern generalizes beyond content. Any corpus where:

The scope is too large for human review at the required cadence
Quality matters enough to warrant systematic checking
Errors have structured types that a rubric can enumerate

...is a candidate for an audit swarm.

Code review across a monorepo. Documentation consistency across a large site. Configuration validation across a multi-service deployment. The five-agent architecture scales to any of these with the same structure: Registrar, Navigator, Auditors, Fact-Checker, Compiler.

The order is the architecture. Enumeration feeds chunking, chunking fans out to the parallel Auditors, and every finding they produce must pass through the single Fact-Checker gate before the Compiler can aggregate it into the report.

The implementation cost is mostly in rubric design. The rubric is the hardest part — not because it is technically complex, but because it requires you to enumerate the failure modes of your system precisely enough that an LLM can apply them consistently.