Ask Knox

The Incident

March 29, 2026. A service was failing because a file inside a Docker container had permissions 600 (owner read/write only) when the running process needed 644 (world-readable). The fix seemed obvious: change the permissions on the source file and rebuild.

chmod 644 config/settings.json
docker compose build memory-service

Build output looked encouraging. The step completed. The container was restarted. The service still failed with a permission error.

The culprit: a single word in the build output that most engineers skim past.

 => CACHED [3/6] COPY config/ /app/config/          0.0s

The cache prevented the permission fix from reaching the image. Twice.

How Docker Layer Caching Works

Every instruction in a Dockerfile produces a — an immutable filesystem snapshot stored in Docker's content-addressable cache. When you rebuild, Docker checks each instruction against its cache to determine whether to reuse the stored layer or execute the instruction fresh.

The cache invalidation rules differ by instruction type:

RUN instructions — Docker compares the command string. If the string is identical to a previous build, the cached layer is used, regardless of what the command does at runtime.

COPY and ADD instructions — Docker computes a checksum of the source files. What goes into that checksum is builder-version dependent: older builders keyed primarily on file content, while current BuildKit releases (the default builder since Docker 23.0) also include permission bits. Timestamps are excluded everywhere.

This is the trap. A chmod 600 → 644 change modifies file metadata, not file content. On the builder version in the March 2026 incident, the cache key did not account for the permission change — Docker served the cached layer and the permission fix never entered the image. On a current BuildKit, the same chmod does bust the cache. You cannot know from memory which behavior your builder has; you have to read the build output.

The Cascade: Two Cached Failures, One --no-cache Success

Here is exactly what happened across the three rebuild attempts on March 29:

Attempt 1 — Standard rebuild after chmod

chmod 644 config/settings.json
docker compose build memory-service
docker compose up -d memory-service

Result: CACHED on COPY step. Container starts with original permissions. Service fails.

Attempt 2 — docker compose build with --pull

docker compose build --pull memory-service

--pull forces Docker to check for a newer base image. It does not bust the COPY cache if the base image is unchanged. Result: still CACHED on COPY. Service still fails.

Attempt 3 — --no-cache

docker compose build --no-cache memory-service
docker compose up -d memory-service

--no-cache disables all layer reuse. Every instruction executes from scratch. The chmod'd file is COPYed fresh. Service starts successfully.

The 21-minute cost came from the combination of --no-cache forcing a full pip install (5 min), re-downloading the embedding model (2 min), and the two wasted intermediate attempts.

When to Use `--no-cache`

--no-cache is a sledgehammer. It solves cache problems but abandons every performance benefit of layered builds. Use it deliberately.

Use --no-cache when:

You changed file permissions or file ownership and the service depends on those
You changed file timestamps (rare, but can affect behavior in some pipelines)
You suspect a dependency has changed but the lockfile hash is identical
A RUN command fetches external resources (git clone, curl) that may have updated
You are debugging a "my fix isn't in the container" situation and need certainty

Do not default to --no-cache for every build. In repos with large dependency installs or model downloads, it can cost 10+ minutes per build.

Better Patterns: Eliminating the Problem

The real fix is to stop relying on host filesystem permissions and handle permissions explicitly in the Dockerfile. There are two reliable approaches.

Pattern 1: `chmod` Inside the Dockerfile

Set permissions as a RUN instruction after COPY:

COPY config/ /app/config/
RUN chmod 644 /app/config/settings.json

When the RUN instruction string changes (e.g., you update the path or mode), Docker invalidates that layer and all subsequent layers. The permission is set inside the image regardless of what the host filesystem says.

Pattern 2: Multi-Stage Builds

Multi-stage builds allow you to separate the "dependency installation" layers (slow, rarely change) from the "application code" layers (fast, change frequently):

# Stage 1: Dependencies — cached aggressively
FROM python:3.11-slim AS deps
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Application — rebuilt on every code change
FROM deps AS app
COPY --chown=appuser:appuser src/ /app/src/
COPY --chown=appuser:appuser config/ /app/config/
RUN chmod 644 /app/config/*.json

With this layout, a code change busts only Stage 2's cache. The slow pip install in Stage 1 is unaffected. You get the cache performance on dependencies while ensuring application files and their permissions are always fresh.

Pattern 3: `.dockerignore` Discipline

Keeping a clean .dockerignore means fewer spurious files are included in the COPY checksum. When Docker only checksums the files you actually need, cache hits are more predictable and busting is less surprising.

# .dockerignore
__pycache__/
*.pyc
.git/
.env
*.log
data/

If data/ or *.log files are accidentally included in a COPY . instruction, any write to those files busts the entire application layer cache. Separate what changes frequently from what changes rarely.

Verifying the Fix Is Actually In the Container

After any rebuild where you suspected a caching problem, verify the fix landed before restarting the service:

# Check the permissions inside the running container
docker exec memory-service ls -la /app/config/settings.json

# Check the build timestamp to confirm a fresh layer
docker inspect memory-service | grep -i created

# Start an interactive shell to explore
docker exec -it memory-service /bin/bash

Do not trust the service behavior as the sole indicator. A cached layer can produce a container that starts successfully but fails only under specific code paths. Explicit verification takes 10 seconds and eliminates an entire class of "my fix didn't work" bugs.

Reading Build Output Defensively

Develop the habit of reading docker compose build output line by line when debugging. The key signals:

=> CACHED [3/6] COPY config/ /app/config/    # Permission fix is NOT in this image
=> [3/6] COPY config/ /app/config/           # No CACHED prefix = fresh copy, fix IS in

When a layer is stale and you need it fresh, the fastest targeted approach is a throwaway ARG that is consumed by an instruction immediately before the affected COPY. The critical detail: an ARG busts the cache at its first usage, not its declaration. A declared-but-unused ARG changes nothing — the COPY below it stays CACHED even when you pass a new value:

ARG CACHE_BUST=1
RUN echo "cache bust: $CACHE_BUST"
COPY config/ /app/config/

Then rebuild with:

docker compose build --build-arg CACHE_BUST=$(date +%s) memory-service

The RUN echo consumes the arg, so a new value forces that layer — and every layer below it, including the COPY — to re-execute. This is more surgical than --no-cache: it preserves cached layers above the bust point (like the pip install stage) while forcing a fresh copy of everything below.

Key Takeaways

Whether a chmod alone busts the COPY cache is builder-version dependent — older builders keyed on file content only, current BuildKit includes permission bits. Read the build output: a CACHED marker on the COPY step means your change did not land.
--no-cache guarantees a full fresh build but forfeits all performance benefits; use it for debugging, not as default practice.
Set permissions inside the Dockerfile with RUN chmod after every COPY — never depend on host filesystem metadata surviving into the image.
Multi-stage builds let you cache slow dependency layers independently from fast-changing application code layers.
Always verify fixes landed with docker exec ls -la rather than inferring from service behavior alone.

What's Next

A stale Docker image is a frustrating waste of time. But there is a class of problem that is not just wasteful — it is actively dangerous. In the next lesson, we look at what happens when you have two instances of a stateful service running simultaneously, and why that scenario is a direct financial risk for any service that manages money or external state.

Docker Cache Hides Fixes