ASK KNOX
beta
LESSON 276

The Review Loop — CodeRabbit Catches What Agents Miss

The calibrator continuity bug is a real example. The code-writing agent did not catch the 18.4 to 17.6 discontinuity at the 0.15 boundary. CodeRabbit did. When to trust the agent, when to verify manually, and why automated reviewers are non-optional for math-heavy changes.

7 min read

PR #28 was written by a coding sub-agent. The diff looked right. Tests passed. CI was green. The author was ready to merge.

Then CodeRabbit's review showed up with a comment on calibrator.py:

Discontinuity at branch boundary p = 0.15. Lower branch yields 18.4 at p = 0.15; upper branch yields 17.6 at the same point. The score drops as input quality increases slightly, which appears to violate the intended shape of this scoring function.

The code-writing agent had not caught it. The unit tests had not caught it. The human author had not caught it. The automated reviewer did — in seconds — and the PR was held until the formula was fixed.

Why This Class of Bug Exists

Code-writing agents are trained to produce code that looks right. "Looks right" for a piecewise scoring function means symmetrical branches, consistent variable names, and plausible formulas. The agent has no strong prior that branch boundaries should be continuous; that is a mathematical property of the function, not a stylistic property of the code. The result is functions that compile, pass tests, and have subtle arithmetic cliffs at every branch point.

Unit tests miss these bugs because tests usually pick round numbers like 0.10 or 0.20 — not 0.14999 and 0.15001. The boundary is never exercised. The bug is invisible until a production signal happens to land exactly on it.

Why CodeRabbit Catches Them

Reviewer LLMs are tuned for a different task: finding anomalies in code by careful reading. They do not run the code; they analyze it as a mathematical object. For a piecewise function, that means:

  • Listing every branch boundary.
  • Substituting the boundary value into both the lower and upper branches.
  • Checking whether the outputs match.

This is a mechanical task that humans find tedious and skim past. LLMs do not skim. They run the analysis systematically on every branch boundary and flag any discontinuity.

Inline Diagram — Three-Layer Review

THREE-LAYER REVIEW — AGENT, REVIEWER LLM, HUMANeach layer catches what the others miss — skip none

When to Trust the Agent

Trust the agent for:

  • Mechanical refactors — rename, extract, move, reformat.
  • Targeted value changes — exactly the scenario PR #28 was dispatched into.
  • Boilerplate generation — scaffolds, test stubs, config files.

Do not trust the agent without review for:

  • Scoring math and piecewise functions — the exact class that produces continuity bugs.
  • Numerical edge cases — sign conventions, off-by-one, floating-point comparisons.
  • Security-sensitive logic — auth, access control, input validation.
  • Concurrency primitives — locks, retries, race conditions.

The dividing line is whether the agent's output has a mechanical verification path. Mechanical changes can be diff-checked. Mathematical changes need a reviewer that understands the math.

The Rule

Every PR involving math, boundaries, or numerical logic goes through the three-layer review: agent writes, CodeRabbit analyzes, human validates. Each layer is mandatory. The calibrator cliff is the cautionary tale — and the exact reason the review loop exists.