LLM-as-Judge: Automated Quality Scoring for Prompts

Autoresearch — an automated prompt-improvement pipeline Knox runs against his skill library — produced, in its April 8, 2026 session, a system that does something most prompt engineers do manually: it evaluates every skill in Knox's library using a structured rubric, generates improved rewrites, and applies the improvements automatically when they clear a set of quality gates.

This is LLM-as-judge applied to the prompt library itself. The system does not just evaluate prompts — it rewrites and ships them. The human's role is calibration and edge case review.

What LLM-as-Judge Actually Means

LLM-as-judge is a pattern where a language model evaluates text — often other AI-generated text — against a structured rubric and produces a score. The score can be used to rank outputs, filter low-quality generations, or in this case, decide whether to apply a rewrite.

The pattern is well-established in alignment research (RLHF uses human judgments as a training signal; RLAIF uses AI judgments). Applied to operational prompt management, it becomes a continuous quality improvement loop: prompts are evaluated, improved, and validated automatically on a cadence, without requiring a human to review each one.

The Five-Dimension Rubric

The Autoresearch system uses five scoring dimensions, each evaluated independently:

Dimension 1 — Precision

Is the instruction specific enough to produce a deterministic output? A prompt that says "write a summary" is low precision. A prompt that says "write a three-sentence summary in past tense, starting with the most important finding" is high precision.

Precision scores low when the prompt allows too many valid interpretations. High precision prompts produce consistent outputs across different runs and different models.

Dimension 2 — Constraint Coverage

Does the skill specify what NOT to do, as well as what to do? Constraints are often the difference between a prompt that works and a prompt that works most of the time.

A skill that says "generate a commit message" but does not say "do not include implementation details, do not exceed 72 characters on the subject line, do not use present tense" will produce correct output sometimes and incorrect output the rest of the time. Constraint coverage scores what percentage of the common failure modes are explicitly ruled out.

Dimension 3 — Output Format Guidance

Does the skill specify the expected output shape? JSON, markdown, prose, structured YAML, a specific number of bullet points — the format specification is the contract between the skill and its consumer.

Skills without output format guidance produce correctly-intentioned but variably-shaped outputs. Downstream consumers that parse the output will encounter errors when the format shifts between runs. High output format guidance scores mean the skill specifies the exact structure, not just the content.

Dimension 4 — Task Alignment

Does the skill's stated intent match the task it is actually invoked for? This dimension catches skill drift — skills that were written for one purpose and are now invoked for a related but different purpose.

A skill originally written to "summarize a meeting transcript" invoked for "generate action items from a meeting transcript" has low task alignment. The intent and the use case have diverged. High task alignment scores mean the skill's description, instructions, and examples all point at the same task.

Dimension 5 — Concision

Is every sentence earning its token cost? This is the hardest dimension to score objectively and the most important for cost management. Long prompts cost more, fill context windows faster, and often perform no better than shorter ones.

Concision scores low when a prompt contains:

Restatements of the same instruction in different words
Preamble that does not constrain behavior
Examples that duplicate rather than extend the instruction
Hedging language that does not add information

Threshold Calibration

The Autoresearch system uses a delta threshold to decide whether a rewrite is worth applying. A delta of 0.10 means the rewrite must score at least 0.10 higher than the original (on the aggregate score) to be auto-applied.

First, fix the polarity in your head: a higher delta threshold is more conservative (it demands a bigger improvement before auto-applying, so fewer rewrites ship); a lower delta threshold is more aggressive (it ships more rewrites with a smaller proven margin).

The initial threshold was set higher — more conservative. After the first run, Knox lowered it to 0.10 because too many valid improvements were landing in the "pending review" queue. The higher threshold was leaving the system under-active — generating good rewrites and not applying them.

This calibration step is not optional. The correct threshold depends on:

The quality distribution of your existing skill library. A low-quality library has lots of headroom, so most rewrites clear a large delta easily — a higher (more conservative) threshold is fine. A high-quality library has small headroom, so genuine improvements show up as small deltas; if you keep the threshold high they all stall in pending review, so you generally need a lower threshold (and lean on the second-level safety filters below).
Your tolerance for false positives (applied rewrites that are actually worse)
Your capacity to review the pending queue (if nobody reviews it, set the threshold lower so good rewrites still ship)

Set the threshold too high and the system produces improvements it never applies. Set it too low and the system applies changes that are marginally better on the rubric but actually worse in production.

The Auto-Apply vs. Pending-Review Split

Not every rewrite that clears the delta threshold gets auto-applied. The system applies a second-level filter:

Auto-apply conditions (ALL must be true):

Score improvement >= delta threshold (0.10)
The rewrite is the same length or shorter than the original
The rewrite does not introduce claims that were not in the original
The rewrite does not structurally change the skill (same sections, same flow)

Pending-review triggers (ANY is sufficient):

The rewrite introduces new claims or factual assertions
The rewrite expands the scope of the skill's task
The rewrite makes structural changes (adds or removes major sections)
The rewrite is longer than the original

The rationale: auto-apply is safe when the rewrite is a tightening of the existing content. When the rewrite adds new material, expands scope, or changes structure, a human needs to verify that the expansion is correct and appropriate.

The Overflow-Reject Behavior

The overflow-reject behavior handles the case where the rewrite is worse than the original on the aggregate score. If the judge produces a rewrite that scores lower than the original, the system rejects it and keeps the original.

This sounds obvious but is easy to omit. Without overflow-reject, a malfunctioning rewrite step (model error, context overflow, edge case in the rubric) can degrade your skill library. The overflow-reject is a floor: no matter what happens in the rewrite pipeline, the current skill is the minimum quality bar.

The implementation is a single comparison before the apply decision:

if rewrite_score < original_score:
    result = "REJECTED"
    action = "keep_original"
    log(f"Rewrite for {skill_id} rejected: {rewrite_score:.3f} < {original_score:.3f}")
    continue

One comparison, one branch. Do not omit it.

The Delta-Gate in Practice

The delta-gate handles a subtle noise problem: trivially improved rewrites that don't justify the risk of applying a change.

A skill that scores 0.97 getting a rewrite that scores 0.98 is a 0.01 delta. At 0.10 threshold, this does not auto-apply. The improvement is real but below the noise floor. Applying a 0.01 delta improvement requires a code change, a git commit, a diff review, and a deployment. The improvement is not worth the overhead.

The delta-gate eliminates this noise. Only improvements with meaningful signal get applied. The threshold of 0.10 means "this rewrite is meaningfully better, not just marginally scored differently."

Connecting to Calibration: The Critical Missing Step

The most important caveat about LLM-as-judge is that a judge without calibration is producing opinion, not measurement.

A rubric that scores "concision" highly will reward shorter prompts. But shorter prompts do not always perform better. A rubric that scores "precision" highly will reward more specific instructions. But over-specified prompts can perform worse on tasks that require model judgment.

Calibration requires empirical data: a set of skills where you know the downstream outcome. For Knox's fleet, calibration means measuring whether agents that received auto-applied rewrites performed better on their tasks — fewer errors, better outputs, lower retry rates — than agents running the original skills.

Without this calibration data, the judge scores are proxies, not measurements. They correlate with quality by design, but correlation is not causation. A rubric that produces high scores for prompts that perform poorly is a rubric that needs revision.

The calibration loop is:

Apply rewrites via auto-apply
Measure downstream agent performance (task success rate, error rate, output quality)
Compare performance for rewrites with high delta vs. low delta
Adjust rubric weights based on what predicts downstream improvement

This loop closes the feedback between the judge's opinion and the system's actual performance. Without it, you have a system that is measuring something. With it, you have a system that is measuring the right thing.

Building Your Own Judge

Starting a prompt quality system from scratch, the minimum viable configuration is:

Three dimensions (Precision, Constraint Coverage, Concision) with a 0-1 score each
A conservative (higher) delta threshold to start — 0.20 — then lower it once calibration shows the pending-review queue is absorbing valid improvements
Auto-apply only for same-or-shorter rewrites with no new claims
Pending-review queue for everything else
Overflow-reject for rewrites that score lower than the original

Run the system for two weeks. Review the pending queue weekly. Calibrate the threshold based on the auto-applied rewrites' downstream performance. Then add dimensions 4 and 5 when you have enough data to weight them correctly.

The system improves itself as it runs. That is the point.