Ask Knox

CI says 92% coverage.

The PR review takes 4 minutes. Two approvals. Merge.

Everyone moves on. The system is trusted. The 92% confers confidence — this codebase is tested, quality gates are working, bugs will be caught.

Except the 92% includes 11 stub test files. Empty bodies. pass statements. Functions defined but never implemented. The import is there. The class is there. The test count is zero.

This is : the appearance of rigor without the substance. The scoreboard says you are winning a game that is not being played.

What Counts as a Stub File

A stub file is any test file that imports modules and exists in your test suite but contains zero real test functions.

In Python, the canonical diagnostic is:

grep -cE "^\s*(async )?def test_" test_file.py

If this returns 0, the file is a stub. Note the \s* — naive patterns like ^def test_ miss class-based tests where the method is indented under a TestCase subclass. The (async )? handles async test functions that are increasingly common in async-first codebases.

Only methods named test_* count. Fixtures — setUp/tearDown in a unittest.TestCase, or @pytest.fixture functions in pytest — are setup/teardown scaffolding, not tests; a test runner never executes them as test cases. A class full of setUp/tearDown and no test_* methods has zero real tests and is a stub, no matter how much code those fixtures contain.

Two classes of stub, and only one is caught by the grep. There are two ways a "test" can verify nothing:

Zero-definition stubs — the file has no def test_* at all (only imports, a class, or fixtures). The ^\s*(async )?def test_ grep counts 0 and flags it. This is what the script above catches.
Empty-body stubs — a function is named def test_checkout() but its body is just pass or a TODO comment with no assert. The grep counts this as a real test (the def test_ line matches), so it is not flagged. The Foresight $324 deduplication bug rode in on exactly this class: the test function was defined, so coverage tools counted it, but the body was pass and asserted nothing.

The grep is a fast first pass for class 1. Class 2 needs a second check — grep each counted test file for bodies that are only pass/.../TODO with no assert — or a reviewer reading the bodies. Treat a green coverage number as suspect until both classes are ruled out.

In TypeScript with Jest or Vitest, stubs look different:

describe("UserService", () => {
  it.todo("should create a user")
  it.todo("should validate email")
  xit("should handle duplicate registration", () => {
    // TODO: implement
  })
})

Every test is it.todo, xit, or xdescribe. The test runner counts the file. The coverage tool counts the imports. Zero assertions are made.

Finding Stubs at Scale

Manual inspection does not scale. This script finds every Python stub in a project:

for f in $(find . -path "*/tests/test_*.py" -not -path "*/node_modules/*"); do
  count=$(grep -cE "^\s*(async )?def test_" "$f" 2>/dev/null)
  count=${count:-0}
  if [ "$count" -eq 0 ]; then
    echo "STUB: $f (0 tests)"
  fi
done

One subtlety: grep -c prints 0 and exits non-zero when there are no matches. Do not append || echo 0 — that fires on the exit code and corrupts the count into two lines (0 twice), which breaks the integer comparison and means the STUB line never prints for the exact files you are hunting. The ${count:-0} fallback only covers the unreadable-file case. Verify any detector against a known stub file before you trust it.

Run this in any Python project root. Every file it outputs is costing you coverage points without providing coverage value. In a project with 40 test files, it is not unusual to find 5–8 stubs. That is 12–20% of the test suite contributing nothing except a falsely elevated coverage number.

For TypeScript:

for f in $(find . -name "*.test.ts" -o -name "*.spec.ts" | grep -v node_modules); do
  real_count=$(grep -cE "^\s*(it|test)\s*\(" "$f" 2>/dev/null)
  real_count=${real_count:-0}
  skip_count=$(grep -cE "^\s*(it\.todo|xit|xtest)\s*\(" "$f" 2>/dev/null)
  skip_count=${skip_count:-0}
  if [ "$real_count" -eq 0 ] && [ "$skip_count" -gt 0 ]; then
    echo "ALL_SKIPPED: $f ($skip_count skipped)"
  elif [ "$real_count" -eq 0 ]; then
    echo "STUB: $f (0 real tests)"
  fi
done

The Resolution Rule

Finding a stub is the beginning, not the end. Before acting, ask one question: is this functionality tested somewhere else?

# Find test coverage of a specific module
grep -r "import UserService\|from.*UserService" tests/
grep -r "UserService\|create_user\|validate_email" tests/ --include="*.py"

This matters because stubs sometimes survive refactors where the tests moved to a different file or the module was merged into a larger integration test. The stub is orphaned overhead — it is not covering a gap, it is just taking up space and inflating the count.

If the functionality is covered elsewhere: Delete the stub. It is not a gap — it is a coverage scam. Deleting it is more honest than keeping it, because at least the coverage number will reflect reality.

If the functionality is genuinely untested: The stub becomes a real obligation. Write at minimum three test functions:

Happy path — the function does what it is supposed to do under normal conditions.
Error path — the function handles a failure condition correctly (bad input, missing data, network error).
Edge case — the boundary value, the empty input, the concurrent call, or whatever the highest-risk scenario is for this specific module.

Three is the floor. It forces you to think about failure modes instead of just proving the function runs once.

Coverage Threshold Theater

Stubs are not the only way quality gates lie. Configuration mismatches are equally dangerous and easier to miss.

# pyproject.toml
[tool.pytest.ini_options]
addopts = "--cov=src --cov-fail-under=85"

# CLAUDE.md
Testing Mandate: 90% coverage floor. Non-negotiable.

These two files exist in the same repo. CI runs pytest. The pipeline passes at 85%. The CLAUDE.md says 90%. Nobody notices the discrepancy because the pipeline is green.

The lower number wins silently. Always.

To find mismatches across a project:

# Check configured threshold
grep -r "cov-fail-under\|coverageThreshold\|branches.*[0-9]" \
  pyproject.toml pytest.ini setup.cfg .nycrc vitest.config.ts jest.config.ts 2>/dev/null

# Check documented policy
grep -i "coverage\|90%\|85%\|floor" CLAUDE.md README.md docs/CONTRIBUTING.md 2>/dev/null

If these produce different numbers, the configured number is your real policy regardless of what the docs say. Fix the configuration to match the documented standard — not the other way around.

Why Stubs Happen

Stubs are not malicious. They are the artifact of reasonable workflows that were never completed.

The refactor case: tests are moved to a new location. The old file is left behind as a placeholder. Nobody cleans it up because the tests run and the coverage is fine.

The TDD stub case: a developer creates test file structure before implementation. The implementation ships. The tests get postponed. The postponement becomes permanent.

The agent case: a coding agent is tasked with "add test coverage for the trading module." It creates test files, defines test functions with descriptive names, and writes bodies that are either pass or TODO comments. CI passes. The agent reports success. No real assertions were made. The Foresight trading bot had exactly this happen during a refactor sprint — a stub file persisted for months, coverage stayed green, and the underlying functions had zero real test coverage the entire time.

The Audit Command

Run this before any merge that touches test infrastructure:

# Full stub audit: Python
echo "=== Python Stubs ===" && \
for f in $(find . -path "*/test*.py" -not -path "*/.venv/*" -not -path "*/node_modules/*"); do
  count=$(grep -cE "^\s*(async )?def test_" "$f" 2>/dev/null)
  count=${count:-0}
  [ "$count" -eq 0 ] && echo "  STUB: $f"
done

# Coverage threshold check
echo "=== Coverage Config ===" && \
grep -r "fail-under\|coverageThreshold" pyproject.toml pytest.ini setup.cfg \
  vitest.config.ts jest.config.ts 2>/dev/null | head -10

Add it to your CI pipeline as a non-blocking check first. Observe what it surfaces. After one sprint of observation, make it blocking.