CI Pipeline Debugging: Trust the Logs, Not Your Assumptions

You open GitHub and see two CI runs in progress. Two PRs are open. You glance at the run IDs — they are adjacent numbers. Pattern-matching kicks in immediately: one run per PR, one PR per run. Obvious.

You check the failing run. It is reporting 0% coverage on a module you have never touched. The module does not exist on the branch you are looking at. You spend five minutes reading diff output, checking test configuration, wondering if the coverage tool has a bug.

It does not have a bug. Both runs belong to the same PR. You were analyzing the wrong thing the entire time.

This lesson is about the class of mistake that costs five minutes, ten minutes, or an entire debugging session — not because the problem is hard, but because you trusted intuition instead of spending two seconds to verify.

The Assumption Trap

When multiple PRs trigger CI concurrently, GitHub Actions assigns run IDs sequentially by trigger time. It does not care which PR came first. It does not group runs by PR. A PR with two pushes in quick succession will generate two adjacent run IDs. Two concurrent PRs might generate interleaved run IDs. The assignment is chronological, not organizational.

Human pattern-matching does not know this. It sees two PRs and two run IDs and distributes them one-to-one. The logic feels airtight. It is wrong.

The fix is a single command:

# Verify which branch a run belongs to — takes two seconds
gh run view <id> --json headBranch -q .headBranch

Two seconds of verification versus five minutes of impossible debugging. The ratio is not close.

Reading CI Logs as a Remote Debugger

There are valid reasons you do not run the full test suite locally before every push: memory pressure, slow network, missing dependencies, time constraints. CI is not a backup — it is often the primary environment where the actual test matrix runs. That means the CI log is your debugger.

Most engineers skim CI logs. The discipline is reading them systematically.

# Get the actual failure — last 25 lines of the failed step
gh run view <id> --log-failed 2>&1 | tail -25

# Find low-coverage modules without running anything locally
gh run view <id> --log 2>&1 | grep "src/" | grep -v "100%"

# Find the specific lines that failed an assertion
gh run view <id> --log-failed 2>&1 | grep -A 5 "FAILED\|AssertionError\|Error:"

# Verify which branch the run belongs to before analyzing anything
gh run view <id> --json headBranch -q .headBranch

The --log-failed flag is the highest-value flag in the gh run toolkit. It fetches only the output from failed steps, skipping everything that passed. On a 200-step CI run, this is the difference between reading 40,000 lines and reading 300.

The CodeRabbit Comment Lifecycle

Automated code reviewers — CodeRabbit, Gemini Code Assist, others — post comments tied to specific commits. When you push a fix and the reviewer re-runs on the new commit, it posts a fresh review. The problem: it may re-post comments that reference code in an earlier commit. The comment exists in the PR timeline. It looks like an active unresolved issue.

It is not. It is a stale reference.

The trap is treating every comment in the current review cycle as a new unresolved issue requiring action. If a comment was surfaced on commit a3f2b1 and you are now on d9e4c8, you need to verify whether the referenced code still exists at HEAD before spending time on it.

Two filters make this fast:

# List comments on a PR with timestamps — spot the stale ones
gh api repos/<owner>/<repo>/pulls/<number>/comments \
  --jq '.[] | {created_at, path, line, body: .body[0:80]}'

# Check if the flagged line still exists at HEAD
gh pr diff <number> | grep -n "<the flagged pattern>"

If the comment's line reference does not appear in the current diff and the created_at timestamp predates your last push, it is stale. Move on.

The Review → Fix → Verify Loop

The disciplined CI workflow has a specific shape. Each step has a defined action and a defined exit condition.

Step 1 — PR created: CI runs. Automated reviewers post initial comments. Do not look at these immediately — wait for the full cycle to complete.

Step 2 — Filter for Major+ severity: Ignore nitpicks and informational comments during velocity work. CodeRabbit uses severity labels. Major and Critical require action. Low and Info are logged, not actioned during the sprint.

# Pull only Major+ comments from CodeRabbit
gh api repos/<owner>/<repo>/pulls/<number>/reviews \
  --jq '.[] | select(.body | test("(?i)major|critical|high")) | {submitted_at, body: .body[0:120]}'

Step 3 — Fix → push → wait: Make the fix. Push. Wait for CI to complete and the reviewer to re-run. Do not analyze the next review until both are done.

Step 4 — New review: filter by timestamp, verify against HEAD: For each comment in the new review, check created_at. If it predates your push, compare the flagged code against HEAD. If the issue is resolved, mark it resolved and move on. If the same comment appears again with a new timestamp, it is genuinely unresolved.

Step 5 — All Majors resolved + CI green → merge.

The shape matters. Skipping step 4 is where most of the wasted time lives.

Coverage Reports as Architecture Maps

A coverage report is not just a gate. It is an X-ray.

A new module sitting at 0% tells you the test file is missing or broken. A module dropping from 90% to 77% after a code addition tells you the new execution paths are untested. These are not just numbers — they are a map of where confidence ends.

# Find weak spots without running anything locally — anything below 90%
# (anchored so 100% and 90%+ lines don't false-match on the trailing "0%")
gh run view <id> --log 2>&1 | grep -E '(^|[^0-9.])([0-9]|[0-8][0-9])(\.[0-9]+)?%'

# Find modules that dropped since the last run
gh run view <id> --log 2>&1 | grep "src/" | sort
# Compare against previous run
gh run view <previous-id> --log 2>&1 | grep "src/" | sort

The 0% filter is the highest-signal filter. A file at 0% is an uncovered module, a broken import in the test suite, or a file that was added without a corresponding test. All three are worth investigating before merge.

The drop filter — anything below your floor — catches regressions. If your project enforces 90% coverage and a module drops to 77%, something in the new code is not tested. Find it before the reviewer does.

The Meta-Lesson: Systems Over Intuition

Every mistake in this lesson has the same root cause: trusting intuition over verification.

Adjacent run IDs "obviously" belong to different PRs. A CodeRabbit comment "obviously" means the issue is unresolved. A module at 0% "obviously" means something is wrong with your branch. In each case, the obvious inference is possible but not confirmed — and acting on an unconfirmed inference burns time at best and creates confusion at worst.

The fix is not to think harder or be more careful. The fix is to build systems — commands, checklists, shell functions — that verify instead of assume.

# Add to ~/.zshrc or ~/.bashrc — verify before you analyze
# (functions, not aliases — aliases cannot take a $1 argument)
ci-branch() { gh run view "$1" --json headBranch -q .headBranch; }
ci-fail() { gh run view "$1" --log-failed 2>&1 | tail -25; }
ci-coverage() { gh run view "$1" --log 2>&1 | grep -E '(^|[^0-9.])([0-9]|[0-8][0-9])(\.[0-9]+)?%'; }

Three functions. The first one alone would have saved the five minutes at the top of this lesson.

Lesson Drill

Pick the last three CI failures in any repo you work in. For each one:

Run gh run view <id> --json headBranch -q .headBranch — confirm you were looking at the right branch
Run gh run view <id> --log-failed 2>&1 | tail -25 — read the actual failure signal
If there are automated review comments, check each one's created_at against the timestamp of the most recent push
Run gh run view <id> --log 2>&1 | grep -E '(^|[^0-9.])([0-9]|[0-8][0-9])(\.[0-9]+)?%' — identify any coverage weak spots (anything below 90%)

For at least one of those three runs, you will find either a context mismatch (wrong branch assumption), a stale comment you would have actioned, or a coverage gap you had not noticed. That is the return on two minutes of systematic verification.

Add the three functions above to your shell config. The next time CI fails, run them before you start reading diffs.

Bottom Line

CI debugging wastes time in predictable ways: wrong branch context, stale automated review comments, unread log output. None of these require complex diagnosis. They require verification habits applied before analysis begins. gh run view <id> --json headBranch before any analysis. --log-failed to cut the noise. Timestamp filtering before actioning any automated comment. Coverage grep to surface the gaps without a local run. The commands are simple. The discipline is applying them every time, not just when something feels off.