ASK KNOX
beta
LESSON 166

The Audit Workflow

Most teams don't have a maintenance process — they have maintenance intent. 'We should clean that up sometime' is not a process, it's a wishlist. This is the five-phase workflow that converts intent into execution.

11 min read·Repo Hygiene & Cost Discipline

Most teams don't have a maintenance process. They have maintenance intent.

"We should clean that up sometime." "That CLAUDE.md is getting long, we'll prune it next sprint." "Those stub tests are temporary, we'll fill them in." These statements are made in good faith and then forgotten in the next standup.

Intent without process is a wishlist. Wishlists don't ship. What ships is a defined workflow with phases, criteria, and a clear finish line.

The five-phase audit workflow converts maintenance from something you mean to do into something you can execute in 45-90 minutes per repo, with a PR as the deliverable.

Phase 1 — Discovery

Discovery is read-only. No changes, no fixes, no judgment calls. Just information gathering.

For each repository in scope, collect:

  • Is this a git repo? (ls .git)
  • Does CLAUDE.md exist? How many lines?
  • Is there a test suite? (find . -name "test_*.py" -o -name "*.test.ts")
  • Does a CI workflow exist? (ls .github/workflows/)
  • What is the primary language?

Print a discovery table before doing anything else. The table forces you to see the full scope before you prioritize. A repo you thought was clean might have three P0s. A repo you expected to be a problem might only need one P1 fix. The example below is drawn from the Foresight Polymarket trading bot portfolio — four repos, four different profiles, all discovered in one pass.

Repo                  CLAUDE.md   Tests   CI     Language
--------------------  ----------  ------  -----  --------
polymarket-bot        465 lines   YES     YES    Python
mission-control       187 lines   YES     YES    Python/TS
content-pipeline                MISSING     NO      NO     Python
prediction-service                 212 lines   YES     YES    Python

Four repos, four different profiles. Discovery takes 10 minutes and tells you everything you need to know before you open a single file.

Phase 2 — Audit

Three sub-audits, run for each repo. Each one has a binary output: PASS or FINDING.

2A: CLAUDE.md Audit

Count lines. Mark PASS if 200 or under. If over 200, identify the sections contributing most to the line count. Common candidates: expanded tool documentation that belongs in a separate file, examples that made sense when added but are now obvious to the team, historical context that no one reads anymore.

The 200-line target is not arbitrary. It is calibrated to context window economics: a 200-line CLAUDE.md is a manageable context injection. A 400-line file doubles the overhead on every agent session. At 10 agent sessions per day, the cost multiplier compounds fast.

2B: Test Suite Audit

Two checks:

Stub detection. A stub file has the name and structure of a test file but zero real test functions. Look for test files where the function count is 0. These files game coverage thresholds — they create the appearance of tested surface area without providing any actual coverage.

Coverage threshold check. What does the CI workflow claim as the threshold? What does CLAUDE.md state as the standard? Do they match? A mismatch means someone updated one without updating the other — a correctness issue in the quality gate configuration.

2C: CI Audit

Three checks from the previous lesson, now applied systematically:

  1. Is there a actions/cache@v4 step for the virtualenv? If Python repo, no cache = P1.
  2. Does Playwright or any long-running E2E job have a path filter? No filter = P1.
  3. Are there targeted npm test aliases (test:logic, test:components)? Missing = P2.

Phase 3 — Planning

Every finding gets a priority. The framework has three tiers.

P0 — Blocking / Correctness. Something is actively wrong right now. The system is being gamed, a threshold is lying, a quality gate is providing false signal.

  • Stub test files (0 real tests) — coverage is being gamed
  • Coverage threshold mismatch between CI and CLAUDE.md

P1 — High Value. Not wrong, but significantly suboptimal. Fixing it has immediate measurable impact.

  • CLAUDE.md over 200 lines — inflated context on every agent session
  • Missing venv cache — 90 seconds of waste per job, per PR
  • Playwright without path filter — 6-8 minutes burned on irrelevant PRs

P2 — Maintenance. Real issues, lower urgency. Address in the current cycle but don't block on them.

  • Todo-placeholder test functions
  • Missing npm test aliases

A real plan output looks like this:

## Maintenance Plan

### P0 — Blocking / Correctness
- [polymarket-bot] Stub test file backend/tests/test_foo.py (0 tests)
- [prediction-service] Coverage threshold mismatch: CI says 85%, CLAUDE.md says 90%

### P1 — High Value
- [polymarket-bot] CLAUDE.md is 465 lines (target ≤200) — est. -280 lines
- [content-pipeline] No CI workflow exists — no test gate on any PR
- [prediction-service] Missing venv cache — 3 pip installs per PR (~90s wasted each)
- [prediction-service] CLAUDE.md is 212 lines — est. -20 lines

### P2 — Maintenance
- [polymarket-bot] 4 todo-placeholder test functions across 2 files
- [mission-control] No npm test aliases defined

The plan is also a communication artifact. Share it before you write a single line of code. P0s may need discussion — stub tests may be intentionally temporary (unlikely, but ask). P1s may have context you don't know. The plan creates alignment before execution.

Phase 4 — Execution

One PR per repo. Batch safe fixes together (CLAUDE.md pruning and CI optimization in the same PR). Keep stub tests in a separate PR — filling stubs requires understanding the untested code, and that work should be reviewable in isolation.

Branch naming convention:

  • chore/prune-claude-md — CLAUDE.md reduction
  • fix/stub-tests — filling stub test files
  • chore/ci-optimizations — venv cache, path filters, test aliases

When pruning CLAUDE.md, do not guess at what is safe to remove. Ask: does this section change how an agent approaches this repo? If an agent without this section would make the same decisions as one with it, the section is prunable. Move detailed context into linked files rather than deleting it — create docs/tool-setup.md for tool-specific instructions that are too long for CLAUDE.md.

When fixing stub tests, write real tests. Not minimal passing tests — real behavioral coverage of what the function actually does. A stub replaced by assert True is not a fix, it is a different lie.

Phase 5 — Review and Merge

Create the PR. Wait two minutes. CodeRabbit and Gemini will have reviewed by then.

Two CodeRabbit findings appear on nearly every audit PR:

Stale default values. The reviewer may flag a number you wrote as a threshold or line count and suggest you verify it. Always grep the source before writing a number in a PR. If you wrote that a CLAUDE.md was 465 lines, verify it was 465 at the time the PR was written, not after you pruned it.

Missing language identifiers on fenced code blocks. Every ``` without a language tag (bash, yaml, json, text) triggers an MD040 Major finding. Fix it before the reviewer does: go through every code block in the PR diff and add the language identifier. It takes 30 seconds and eliminates the most common merge blocker on maintenance PRs.

Fix Major+ findings. Merge. Move to the next repo.

What the Workflow Teaches You

Running the audit is useful. The act of running it is also instructive in ways a dashboard never is.

Pruning CLAUDE.md forces you to ask "do we actually need this?" about every section. The answer is usually no for 30-40% of the file. That questioning reveals what your team actually uses versus what someone added six months ago and never referenced again.

Filling stub tests forces you to understand the untested surface area — concretely, function by function. You will find behaviors nobody documented and assumptions baked in that were never tested.

Adding path filters forces you to map what actually changes and why. The workflow is not housekeeping. It is architectural feedback delivered on a schedule.