ASK KNOX
beta
LESSON 273

Anatomy of a Gold-Standard Spec

The Hermes PR #28 spec breakdown. File paths, line numbers, exact scaling factors, acceptance criteria, backtest method. Zero clarification round-trips. Why specificity beats narrative when dispatching to sub-agents — and what a template looks like.

9 min read

The Hermes PR #28 spec was dispatched to a coding sub-agent with no follow-up needed. The agent returned a working PR that passed CI, cleared CodeRabbit review with one math-boundary fix, and merged on the first attempt. Zero clarification round-trips.

This lesson is the spec in full, with annotations explaining why each section is there.

The Spec (with annotations)


Title: Rebalance Hermes scoring weights so Grok+Perplexity can clear threshold 70 without calibrator

Motivation: Recent query shows calibrator_score = 0 on 89% of the last 64 signals. With the current weights (Grok 30, Perplexity 30, News 15, Calibrator 25), the effective ceiling without calibrator is 75, which leaves only 5 points of headroom above threshold 70. Real signals are not clearing. We need to restructure the weights so the effective ceiling without calibrator rises to 100+.

This section exists so the agent can sanity-check its work against a stated goal. Without it, there is no way to verify the diff actually addresses the problem.

Exact changes:

  1. src/scoring/grok_feed.py line 274:

    • Before: return deviation * 30.0 * multiplier
    • After: return deviation * 35.0 * multiplier
  2. src/scoring/perplexity_feed.py line 198:

    • Before: return confidence * 30.0
    • After: return confidence * 50.0
  3. src/scoring/calibrator.py line 142:

    • Before: return agreement_score * 25.0
    • After: return agreement_score * 20.0
  4. src/scoring/composite.py line 88:

    • Update the docstring to reflect the new weights: Grok 35, Perplexity 50, News 15, Calibrator 20.

Each change names the file, the line number, and the exact before/after. The agent does not need to search the codebase or interpret intent.

Acceptance criteria:

  • All four files compile and pass existing unit tests without modification.
  • The backtest query (see below) produces a count of cleared signals in the range 3-12 for the last 14 days.
  • No existing signal that previously cleared threshold 70 should fail to clear under the new weights (regression check).
  • Docstring changes are consistent with the code changes.

Concrete, measurable, binary-evaluable conditions. The agent knows exactly what "done" means.

Backtest methodology:

Run this SQL against the Hermes database after applying changes:

SELECT COUNT(*) FROM (
  SELECT
    (grok_score * 35.0 / 30.0)
    + (perplexity_score * 50.0 / 30.0)
    + news_score
    + (calibration_score * 20.0 / 25.0) AS new_composite
  FROM hermes_signals
  WHERE created_at > NOW() - INTERVAL '14 days'
) WHERE new_composite > 70;

Expected result: between 3 and 12. If outside that range, stop and report before opening the PR.

Explicit validation path, including the range that counts as "success." The agent cannot ship a result that is wildly off without tripping the guard.

Regression check:

Run the backtest query above with the OLD weights (30/30/15/25) filtered to signals that cleared 70 historically. Every one of those signals must still clear 70 under the new weights. Zero regressions allowed.

Prevents the fix from being a silent breakage of previously-working behavior.


Inline Diagram — Spec Anatomy

SPEC ANATOMY — 5 MANDATORY SECTIONSRESULT: ZERO CLARIFICATION ROUND-TRIPS, FIRST-ATTEMPT MERGEspecificity is what makes delegation work

Why Specificity Beats Narrative

A narrative spec — "restructure the scoring so the dead calibrator does not block signals" — forces the agent to make decisions. Which weights? Which files? Which backtest? Every decision is a place where the agent can diverge from your intent, and every divergence produces a round-trip.

A specific spec — exact file paths, line numbers, values, and success criteria — leaves only the mechanical work of applying the diff and running the validation. The agent cannot diverge because there is no discretion to diverge into.

The Rule

Five sections. Every one has a purpose. Motivation for sanity. Exact changes for clarity. Acceptance criteria for "done." Backtest method for validation. Regression check for safety. The Hermes PR #28 spec is the template — copy it, adapt it, and use it for every sub-agent dispatch that touches production-critical code.