LLM-as-Judge: Automated Quality Scoring for Prompts
How Knox built a system that scores, rewrites, and auto-applies improvements to its own skill library — the five-dimension rubric, the delta-gate, the overflow-reject behavior, and why you need empirical calibration before trusting any judge score.
The Autoresearch session on April 8, 2026 produced a system that does something most prompt engineers do manually: it evaluates every skill in Knox's library using a structured rubric, generates improved rewrites, and applies the improvements automatically when they clear a set of quality gates.
This is LLM-as-judge applied to the prompt library itself. The system does not just evaluate prompts — it rewrites and ships them. The human's role is calibration and edge case review.
What LLM-as-Judge Actually Means
LLM-as-judge is a pattern where a language model evaluates text — often other AI-generated text — against a structured rubric and produces a score. The score can be used to rank outputs, filter low-quality generations, or in this case, decide whether to apply a rewrite.
The pattern is well-established in alignment research (RLHF uses human judgments as a training signal; RLAIF uses AI judgments). Applied to operational prompt management, it becomes a continuous quality improvement loop: prompts are evaluated, improved, and validated automatically on a cadence, without requiring a human to review each one.
The Five-Dimension Rubric
The Autoresearch system uses five scoring dimensions, each evaluated independently:
Dimension 1 — Precision
Is the instruction specific enough to produce a deterministic output? A prompt that says "write a summary" is low precision. A prompt that says "write a three-sentence summary in past tense, starting with the most important finding" is high precision.
Precision scores low when the prompt allows too many valid interpretations. High precision prompts produce consistent outputs across different runs and different models.
Dimension 2 — Constraint Coverage
Does the skill specify what NOT to do, as well as what to do? Constraints are often the difference between a prompt that works and a prompt that works most of the time.
A skill that says "generate a commit message" but does not say "do not include implementation details, do not exceed 72 characters on the subject line, do not use present tense" will produce correct output sometimes and incorrect output the rest of the time. Constraint coverage scores what percentage of the common failure modes are explicitly ruled out.
Dimension 3 — Output Format Guidance
Does the skill specify the expected output shape? JSON, markdown, prose, structured YAML, a specific number of bullet points — the format specification is the contract between the skill and its consumer.
Skills without output format guidance produce correctly-intentioned but variably-shaped outputs. Downstream consumers that parse the output will encounter errors when the format shifts between runs. High output format guidance scores mean the skill specifies the exact structure, not just the content.
Dimension 4 — Task Alignment
Does the skill's stated intent match the task it is actually invoked for? This dimension catches skill drift — skills that were written for one purpose and are now invoked for a related but different purpose.
A skill originally written to "summarize a meeting transcript" invoked for "generate action items from a meeting transcript" has low task alignment. The intent and the use case have diverged. High task alignment scores mean the skill's description, instructions, and examples all point at the same task.
Dimension 5 — Concision
Is every sentence earning its token cost? This is the hardest dimension to score objectively and the most important for cost management. Long prompts cost more, fill context windows faster, and often perform no better than shorter ones.
Concision scores low when a prompt contains:
- Restatements of the same instruction in different words
- Preamble that does not constrain behavior
- Examples that duplicate rather than extend the instruction
- Hedging language that does not add information
Threshold Calibration
The Autoresearch system uses a delta threshold to decide whether a rewrite is worth applying. A delta of 0.10 means the rewrite must score at least 0.10 higher than the original (on the aggregate score) to be auto-applied.
The initial threshold was more conservative. After the first run, Knox adjusted it to 0.10 because too many valid improvements were landing in the "pending review" queue. The default threshold was leaving the system under-active — generating good rewrites and not applying them.
This calibration step is not optional. The correct threshold depends on:
- The quality distribution of your existing skill library (low-quality libraries tolerate aggressive thresholds; high-quality libraries need conservative ones)
- Your tolerance for false positives (applied rewrites that are actually worse)
- Your capacity to review the pending queue (if nobody reviews it, set the threshold lower)
Set the threshold too high and the system produces improvements it never applies. Set it too low and the system applies changes that are marginally better on the rubric but actually worse in production.
The Auto-Apply vs. Pending-Review Split
Not every rewrite that clears the delta threshold gets auto-applied. The system applies a second-level filter:
Auto-apply conditions (ALL must be true):
- Score improvement >= delta threshold (0.10)
- The rewrite is the same length or shorter than the original
- The rewrite does not introduce claims that were not in the original
- The rewrite does not structurally change the skill (same sections, same flow)
Pending-review triggers (ANY is sufficient):
- The rewrite introduces new claims or factual assertions
- The rewrite expands the scope of the skill's task
- The rewrite makes structural changes (adds or removes major sections)
- The rewrite is longer than the original
The rationale: auto-apply is safe when the rewrite is a tightening of the existing content. When the rewrite adds new material, expands scope, or changes structure, a human needs to verify that the expansion is correct and appropriate.
The Overflow-Reject Behavior
The overflow-reject behavior handles the case where the rewrite is worse than the original on the aggregate score. If the judge produces a rewrite that scores lower than the original, the system rejects it and keeps the original.
This sounds obvious but is easy to omit. Without overflow-reject, a malfunctioning rewrite step (model error, context overflow, edge case in the rubric) can degrade your skill library. The overflow-reject is a floor: no matter what happens in the rewrite pipeline, the current skill is the minimum quality bar.
The implementation is a single comparison before the apply decision:
if rewrite_score < original_score:
result = "REJECTED"
action = "keep_original"
log(f"Rewrite for {skill_id} rejected: {rewrite_score:.3f} < {original_score:.3f}")
continue
One comparison, one branch. Do not omit it.
The Delta-Gate in Practice
The delta-gate handles a subtle noise problem: trivially improved rewrites that don't justify the risk of applying a change.
A skill that scores 0.97 getting a rewrite that scores 0.98 is a 0.01 delta. At 0.10 threshold, this does not auto-apply. The improvement is real but below the noise floor. Applying a 0.01 delta improvement requires a code change, a git commit, a diff review, and a deployment. The improvement is not worth the overhead.
The delta-gate eliminates this noise. Only improvements with meaningful signal get applied. The threshold of 0.10 means "this rewrite is meaningfully better, not just marginally scored differently."
Connecting to Calibration: The Critical Missing Step
The most important caveat about LLM-as-judge is that a judge without calibration is producing opinion, not measurement.
A rubric that scores "concision" highly will reward shorter prompts. But shorter prompts do not always perform better. A rubric that scores "precision" highly will reward more specific instructions. But over-specified prompts can perform worse on tasks that require model judgment.
Calibration requires empirical data: a set of skills where you know the downstream outcome. For Knox's fleet, calibration means measuring whether agents that received auto-applied rewrites performed better on their tasks — fewer errors, better outputs, lower retry rates — than agents running the original skills.
Without this calibration data, the judge scores are proxies, not measurements. They correlate with quality by design, but correlation is not causation. A rubric that produces high scores for prompts that perform poorly is a rubric that needs revision.
The calibration loop is:
- Apply rewrites via auto-apply
- Measure downstream agent performance (task success rate, error rate, output quality)
- Compare performance for rewrites with high delta vs. low delta
- Adjust rubric weights based on what predicts downstream improvement
This loop closes the feedback between the judge's opinion and the system's actual performance. Without it, you have a system that is measuring something. With it, you have a system that is measuring the right thing.
Building Your Own Judge
Starting a prompt quality system from scratch, the minimum viable configuration is:
- Three dimensions (Precision, Constraint Coverage, Concision) with a 0-1 score each
- A conservative delta threshold (0.20 initially — tighten after calibration)
- Auto-apply only for same-or-shorter rewrites with no new claims
- Pending-review queue for everything else
- Overflow-reject for rewrites that score lower than the original
Run the system for two weeks. Review the pending queue weekly. Calibrate the threshold based on the auto-applied rewrites' downstream performance. Then add dimensions 4 and 5 when you have enough data to weight them correctly.
The system improves itself as it runs. That is the point.