ASK KNOX
beta
Learn/Discipline/Lesson 175
LESSON 175

Atomic Writes as Default

Four out of five state files in a production audit used direct writes. One crash mid-write, and the file is corrupt. The atomic write pattern takes three lines and makes corruption impossible.

9 min read·Discipline

A 3-agent audit of the InDecision signal pipeline found five JSON state files. Four of them used write_text() directly — the standard Python one-liner for writing to disk. One of them, signal_tracker.json, used a different pattern: write to a temp file, then rename it over the target.

signal_tracker.json was the only file in the audit that was crash-safe.

The other four were one bad restart away from corrupted state and silent cascading failures.

What Direct Write Failure Looks Like

# Direct write — standard, common, unsafe
path.write_text(json.dumps(state))

This is a single Python call. Under the hood, it opens the file for writing, erasing the previous content, then writes the new content. If the process crashes between "open" and "write complete," the file is in an undefined state: the old content is gone, the new content is incomplete.

# What happens on next startup after a crash mid-write:
with open(path) as f:
    state = json.load(f)  # JSONDecodeError: Expecting value: line 1 column 1 (char 0)
                           # Or worse: partial valid JSON that parses but contains garbage

There is no rollback. The OS does not know you were writing a JSON state file. It only knows the write did not finish.

What Atomic Write Safety Looks Like

import tempfile, os

# Atomic write — correct default for all runtime state files
tmp = path.with_suffix('.tmp')
tmp.write_text(json.dumps(state))
os.replace(tmp, path)  # atomic on POSIX

The logic is simple:

  1. Write the new content to a sibling .tmp file
  2. If the process crashes during step 1 — the .tmp is incomplete or absent. The original file is untouched.
  3. os.replace() is a POSIX rename: the old file exists at the target path until the new one is fully in place. There is no window where the path is broken.

On POSIX filesystems, rename is a single kernel syscall. It is either done or not — it does not partially apply.

The Pattern in Three Lines

Wrap it once and reuse everywhere:

def write_state_atomic(path: Path, data: dict) -> None:
    tmp = path.with_suffix('.tmp')
    tmp.write_text(json.dumps(data, indent=2))
    os.replace(tmp, path)

Three lines. One function. Call it instead of write_text() for every runtime state file.

When This Matters

Every state file that is written at runtime and read on startup is at risk. The specific cases in production systems:

  • Trade records — position tracking, entry prices, stop-loss levels. A corrupt trade record opens a position that was already closed.
  • Signal dedup caches — which signals have fired. Corruption means every historical signal looks new. Every alert re-fires.
  • Retry queues — which items have been processed. Corruption means processed items re-enter the queue.
  • Config overrides — runtime config that differs from the default. Corruption silently reverts to default behavior.
  • Candle and market state — for bots that maintain local market snapshots.

One class of file where direct write is acceptable: config files written once at initial setup and not modified during normal operation. No runtime write risk, no need for the atomic pattern.

The Real Cost: The InDecision Signal Pipeline

signal_tracker.json tracks which signals have already been processed by the InDecision alert engine. It is read on every startup to initialize the dedup filter.

A corrupted file does not raise an obvious error. Depending on how it fails, it may parse successfully as an empty object. The dedup filter initializes — it just initializes with no history.

The consequence: every signal that has already fired looks new. The engine re-sends every alert. Positions that are already tracked get re-opened. The portfolio state diverges from reality.

The crash that would cause this is not exotic — a signal ingestion spike, a deployment restart at the wrong moment, a system update during a write cycle. Any of them can put the process in the window between "old content erased" and "new content flushed."

signal_tracker.json already had the atomic pattern because the developer who wrote it had seen this failure before. The other four files had not been written by someone with that experience. The audit fixed all four in the same PR.

Test Coverage for Atomic Writes

Atomic writes should have a regression test that validates crash-safety:

def test_corrupt_state_does_not_survive_crash():
    """If the .tmp exists but os.replace never ran, original file should survive."""
    path = Path("/tmp/test_state.json")
    original = {"position": "open", "entry": 1.234}

    # Write valid original state atomically
    write_state_atomic(path, original)

    # Simulate crash: write .tmp but skip os.replace
    tmp = path.with_suffix('.tmp')
    tmp.write_text('{"incomplete":')  # partial JSON

    # Original file must still be valid
    loaded = json.loads(path.read_text())
    assert loaded == original

    # Cleanup tmp (the crash left it behind)
    tmp.unlink(missing_ok=True)

This test verifies the contract: even in the crash scenario, the original file survives intact.

The Rule

Every JSON state file written during runtime uses atomic writes. Not most of them. All of them.

The rule is a default, not a case-by-case judgment call. Case-by-case judgment is how four out of five files end up unsafe while one file is safe because its author happened to know the pattern.

Lesson 175 Drill

Audit every JSON state file in your current project that is written during runtime. For each one:

  1. Is it using write_text() or equivalent direct write?
  2. What happens to the pipeline if that file is corrupted on next startup?
  3. How many minutes would it take to diagnose the failure vs. prevent it?

If any file fails question 1, replace it with write_state_atomic() before the next deployment. The fix is three lines and takes under five minutes per file. The alternative is a post-mortem about a failure mode that has been documented and preventable for decades.