// Track

Testing AI-Adjacent Systems

Evaluation, Audit, and Quality Assurance for AI Pipelines

Design evaluations for agent outputs, run audit swarms, handle knowledge cutoff as a testing concern, and build LLM-as-judge systems for automated quality scoring. Drawn from real audit runs across Knox's fleet — including the SP-001 false positive incident and the Autoresearch prompt quality system.

Recommended: Complete Tests Pass ≠ System Works first

6 lessons~55 min total

Lessons are shown in recommended order. Complete them in sequence for the best experience — or jump to any lesson.

Lesson 295·9 min read

The Audit Swarm Pattern

Five roles, 277 lessons, one pass — with N Auditor instances running in parallel. How to architect a multi-agent audit that covers what no human reviewer can — and why the Fact-Checker is the only thing standing between your swarm and a report full of false positives.

Lesson 296·8 min read

Knowledge Cutoff as a Testing Concern

The SP-001 incident: an audit swarm flagged 25 valid model IDs as CRITICAL errors because its training data predated the Claude 4.6 release. How grounding documents prevent AI systems from confidently invalidating their own outputs.

Lesson 297·9 min read

LLM-as-Judge: Automated Quality Scoring for Prompts

How Knox built a system that scores, rewrites, and auto-applies improvements to its own skill library — the five-dimension rubric, the delta-gate, the overflow-reject behavior, and why you need empirical calibration before trusting any judge score.

Lesson 548·9 min read

Regression-Testing Prompt Pipelines

Prompts are code. When you change a prompt, something downstream changes too — and without a regression harness, you will not know until a user notices. Golden-set fixtures, output diffing, and CI gates that block prompt regressions the same way they block code regressions.

Lesson 549·10 min read

Adversarial Judge-Gaming

LLM-as-judge evaluations look objective but are gameable in at least five documented ways. Sycophancy, length bias, format bias, self-preference, and position effects all inflate scores on content that should not pass. Here is how to detect each bias and build a judge that resists them.

Lesson 550·10 min read

Building a Calibration Harness

Confidence scores are useless until you measure whether they predict anything. Reliability curves, Brier scores, bucketed accuracy, and the ECE gate tell you whether your model's stated confidence corresponds to actual accuracy — and when it does not, how far off it is.