ASK KNOX
beta
LESSON 113

Validation Agents: AI That Checks AI

The validator pattern: a second agent whose only job is to distrust the first. When you cannot trust the primary agent's confidence, you build a second agent whose system prompt is adversarial. Here's how to design validation that actually catches failures.

11 min read·Autonomous Agent Trust

A validation agent is not a second opinion. It is an adversary.

The distinction matters. A second opinion asks: does this look right to you? An adversary asks: how could this be wrong? These produce fundamentally different review behaviors. The second opinion agent will often agree with the primary, because the output is plausible and there is no explicit instruction to distrust it. The adversary agent, given the explicit mandate to find failure, will probe the logic, check the facts, look for edge cases, and flag the things a collegial reviewer would wave through.

If your validation agent is agreeing with the primary agent 95% of the time, your validation agent is not doing its job. It is performing validation theater.

Validation Agent Pattern

The Validator Pattern

The architecture is structurally simple. The primary agent executes the task and produces output. That output is passed to the validation agent before any downstream action is taken. The validation agent runs its review. If the review passes, the output proceeds. If it fails, the output is rejected and the system routes to retry, replan, or escalation depending on the failure category.

The complexity is not in the structure. It is in the validator's system prompt and the failure routing logic. Get those right and the pattern is highly effective. Get them wrong and you have the theater problem.

The validator system prompt must be adversarial. Explicitly. Not: "review this output for quality." That is the second opinion prompt. The adversarial prompt reads: "Your job is to find what is wrong with this output. Assume the primary agent made an error. Your task is to identify it. If you cannot find an error after rigorous review, only then conclude the output is acceptable."

This instruction produces meaningfully different behavior. It activates the model's capacity for skepticism rather than its capacity for charitable interpretation.

The Four Types of Validation

Not all validation is the same. Different task types require different validation focus. Building a validation agent means choosing what kind of failure it is optimized to catch.

Type 1: Factual accuracy validation. The validator checks whether claims in the primary output are true. This requires the validator to have access to authoritative sources — either retrieved at validation time or embedded in its context. The validator compares each factual claim against sources and flags discrepancies. This is the most expensive type of validation but the most critical for research synthesis and factual content tasks.

Type 2: Format compliance validation. The validator checks whether the output conforms to the required structure, schema, or format. This is structurally simpler and often amenable to schema validation (JSON schema, Pydantic models) rather than full LLM review. For tasks with strict output requirements — API responses, database records, structured reports — format validation catches the class of errors that deterministic tools can handle without burning LLM tokens.

Type 3: Logic consistency validation. The validator checks whether the reasoning in the output is internally coherent. Does the conclusion follow from the premises? Are there contradictions between claims? This is the hardest type to automate and the type where adversarial prompting is most important. The validator must not just read the output — it must reason about whether the reasoning is valid.

Type 4: Safety validation. The validator checks whether the output contains content or instructions that could cause harm — to users, to the system, or to third parties. For agents with write access to production systems, safety validation is the gate before any destructive or irreversible action.

Validation Prompts vs. Validation Schemas

A common architectural decision point: when does validation belong to an LLM prompt, and when does it belong to a schema or deterministic validator?

The answer depends on what kind of validation you need.

Use schemas for structural validation. If the output must be valid JSON, parse it as JSON. If it must conform to a Pydantic model, validate it with Pydantic. These validations are fast, free, and deterministic — they always give the same result for the same input. They catch the class of errors where the output is structurally malformed. They do not catch the class where the output is structurally correct but semantically wrong.

Use LLM prompts for semantic validation. When the question is "is this reasoning sound?" or "are these facts accurate?" or "would this action be harmful?", you need an LLM to answer it. Schema validation cannot catch a factual hallucination that is valid JSON. Prompt-based validation can.

The optimal architecture layers both. Schema validation runs first — it is cheap and fast. Outputs that fail schema validation are rejected before burning LLM tokens on semantic review. Outputs that pass schema validation proceed to LLM-based semantic validation.

Building the Adversarial Validator

The practical construction of an adversarial validation prompt has several components that determine its effectiveness.

Explicit failure-seeking instruction. The single most important element. The prompt must tell the model its job is to find what is wrong. "Assume the primary agent made an error" is more effective than "check for errors" because it sets the prior.

Specific failure taxonomy. List the failure types you are looking for. A factual accuracy validator that is told to check for: incorrect dates, cited sources that do not exist, quantitative claims without evidence, and logical non-sequiturs will perform better than one told generically to "check accuracy." Specificity gives the model a search agenda.

Independent context. The validation agent should not share the primary agent's full conversation history. Sharing context risks the validator being primed toward the primary agent's framing. The validator should see only the task specification and the output — not the reasoning chain that produced the output. Reasoning chains are persuasive. You want an adversary, not a sympathetic reader.

Structured output with required evidence. Require the validator to produce structured output: a verdict (pass/fail), a list of specific issues with supporting evidence, and a confidence score for its own judgment. Structured output makes the validation machine-readable and parseable for routing logic.

The validator is your general-in-chief. Its value is its ability to receive the primary agent's output without being impressed by it. Cool, systematic, looking for failure. Not excited by a well-formatted answer.

When the Validator Catches What the Primary Missed

The validation agent earns its cost on the day it catches an error that would have propagated without it.

In production systems I have run, validation agents catch failures in four patterns:

The confident wrong answer. Primary agent produces a confident, well-structured response that contains a factual error. The validator, prompted to check sources, finds the discrepancy and flags it. Without the validator, the error would have been delivered as fact.

The format-compliant logic failure. Primary agent produces JSON that passes schema validation but contains reasoning errors in string fields. The LLM validator, reading the semantic content, catches the logic problem. Schema validation alone would have missed it.

The edge case scope creep. Primary agent, given a task that is slightly outside its training distribution, extends the task beyond what was specified and takes an additional action that was not requested. The safety validator flags the unauthorized action before it executes.

The context drift error. Primary agent applies a pattern from a previous context to a current task where the pattern is no longer valid. The validator, reviewing the specific current task specification, flags the mismatch.

Failure Routing After Validation

The validation agent produces a verdict. The routing logic determines what happens next.

Pass → proceed. The output meets validation criteria. Route to the next stage of the pipeline. Log the validation result for the confidence ledger.

Fail → retry. The validation agent identified a specific, correctable error. Route back to the primary agent with the validation feedback as additional context. The primary agent has a second attempt with explicit knowledge of what it got wrong. Set a maximum retry count — typically two — before escalating to the next path.

Fail → escalate. Either the maximum retry count has been reached, or the validation failure is in the safety category (do not retry safety failures — escalate immediately), or the failure indicates a novel failure pattern that the system does not know how to correct automatically. Route to human review.

Fail → reject. The task cannot be completed reliably. Reject with a detailed failure report. This is appropriate when the validation failure rate on a specific task type exceeds a threshold that indicates the primary agent is systematically unable to perform it correctly.

Lesson 113 Drill

Build an adversarial validator for one existing task in your agent system.

Steps:

  1. Identify the task type with the highest consequence for errors
  2. Write a validation prompt with explicit failure-seeking instruction and a specific taxonomy of 4-6 failure types relevant to that task
  3. Construct 10 evaluation cases: 7 with correct primary agent output, 3 with deliberate errors you planted
  4. Run the validator against all 10 cases and measure its detection rate
  5. Refine the prompt until it catches all three planted errors without false-positives on the seven correct outputs

If you cannot get the validator to catch your planted errors, the validator is not adversarial enough. Strengthen the instruction. A validation agent that cannot be made to catch planted errors will not catch real errors.

Bottom Line

The primary agent's job is to produce output. The validation agent's job is to distrust it.

Build the validator with that separation explicit and intact. Do not soften the adversarial stance for the sake of pass rates. The validator that agrees with everything is failing at its only job. The validator that is genuinely skeptical and catches real errors is paying its operational cost every day it runs.

Trust in autonomous systems is not granted. It is earned through verified performance. The validation agent is the mechanism that does the verifying.