Ask Knox

Eight P0 bugs. One audit. Every single one shared the same root cause.

The code was not crashing. It was not throwing exceptions. It was not logging errors. It was returning default values that looked exactly like success — and execution was continuing downstream as if everything was fine.

This is the fail-open default pattern. It is the most common cause of silent production failures across the production trading fleet, and it is the bug class that is most invisible to standard code review because each instance looks like defensive programming.

What Fail-Open Actually Means

means your code returns a default "OK" value when a failure occurs, instead of raising an exception or returning None. The execution path stays open. The downstream caller receives something that looks like a valid result. Processing continues.

The term comes from physical security: a fail-open lock unlocks when power is cut. The "safe" behavior (open) is the default failure state. In code, the analogy holds: when something goes wrong, the pipeline stays open and keeps processing — on data it does not have, with signals it did not compute, through checks it did not run.

The Sports Bot Failure Chain

Foresight is the market intelligence engine. The sports bot is a separate system — a Polymarket bot for sports markets, built on similar architecture. Auditing the sports bot surfaced 8 P0 bugs, and tracing them backward revealed a single compounding chain.

Here is the chain, step by step:

Step 1: get_market_signal() fails to reach the data API. Rather than raising an exception, it returns a default neutral signal object. The function was written to "never crash" — the neutral signal is the defensive fallback.

Step 2: The downstream bet evaluator receives a valid-looking neutral signal. A neutral signal is not a stop signal — the evaluator fires a bet based on the default signal.

Step 3: detect_anomaly() times out on a DB query. Rather than raising, it returns False. No anomaly detected. The bet proceeds without any anomaly flag.

Step 4: The heartbeat monitor was querying the wrong table — a separate bug, but the fail-open default in detect_anomaly() removed the last check that might have caught it. The bet is invisible to monitoring.

Step 5: After resolution, the event is written to the deduplication table with the original event hash. The original outcome data was never written. The event is now permanently blocked — the hash exists, the data does not. Any future retry for this event will be rejected as a duplicate.

Five links in a failure chain, anchored by fail-open defaults. Steps 1 and 3 are the fail-open defaults themselves — a neutral signal on API failure, False from anomaly detection on timeout. The others are what those defaults enable: step 2 acts on the neutral signal, step 4's separate monitoring bug survives because step 3 removed the last check, and step 5 records the poisoned dedup hash. Each link is individually defensible. Together: an unrecoverable state, an invisible bet, and a poisoned dedup record.

The InDecision Engine Example

The same pattern in a different system. The InDecision engine's OHLCV fetcher (OHLCV = the open/high/low/close/volume market candles a scoring engine reads):

# Fail-open (what it was doing)
def fetch_ohlcv(asset: str) -> list:
    try:
        return api.get_ohlcv(asset)
    except Exception:
        return []  # caller thinks this is valid empty data

The caller could not distinguish between two entirely different conditions:

[] because this asset has no historical data
[] because the API was down, rate-limited, or authentication failed

In both cases, the caller received an empty list. In both cases, the scoring engine received zero data points. In both cases, scoring silently degraded to a neutral signal.

During API outages, the InDecision engine appeared to be operating normally. It was running its full pipeline on empty data, producing neutral scores, generating no alerts. The system looked healthy. It was processing nothing.

Why Developers Write Fail-Open Defaults

The instinct is correct in context. Defensive programming tells you to return a safe default instead of crashing. For a user-facing endpoint, that instinct is right: return a graceful error page instead of a 500. Return cached data instead of failing completely.

The problem is applying that instinct uniformly to internal pipeline stages where "safe" means something different. In a data pipeline, "safe" means the caller knows whether the data is valid. A neutral signal object is not safe if the caller cannot tell it apart from a successfully computed neutral signal. An empty list is not safe if the caller cannot tell it apart from legitimately empty data.

The rule is about information, not exceptions: does the caller have enough information to make the right decision?

If the answer is no — if the return value looks identical whether the call succeeded or failed — then you have a fail-open default, and you need to change it.

The Correct Pattern

# Fail-open (wrong for internal pipeline)
def fetch_ohlcv(asset: str) -> list:
    try:
        return api.get_ohlcv(asset)
    except Exception:
        return []  # caller cannot distinguish failure from empty

# Fail-loud (correct)
def fetch_ohlcv(asset: str) -> list:
    try:
        return api.get_ohlcv(asset)
    except RateLimitError:
        raise DataFetchError(f"Rate limited fetching {asset}")
    except NetworkError as e:
        raise DataFetchError(f"Network failure fetching {asset}: {e}")

Typed exceptions give the caller information. DataFetchError tells the caller the fetch failed. The caller can then decide: skip this cycle and alert, use cached data from last cycle, halt the pipeline, backfill later. With a silent return [], the caller has no decision to make — it just proceeds on nothing.

The pattern is: raise typed exceptions after retries; let callers decide.

User-facing endpoints catch exceptions and return graceful HTTP errors. Internal pipeline stages raise exceptions and surface them to whoever is orchestrating the pipeline. Different layers, different contracts.

The Detection Pattern in Code Audits

# Find except blocks with their handler bodies — these are all candidates
grep -n -A3 "except.*:" src/*.py

# -A3 prints the lines INSIDE each handler, where the raise/log/return lives
# Blocks that re-raise or log are doing their job — skip them
# What remains: except blocks that silently swallow errors
# Inspect each one: does it return a value that looks like success?

The grep surfaces candidates with enough context to triage them. Do not try to filter the candidates with a second grep — the raise or log call lives on the lines inside the handler, not on the except line itself, so a line-based filter cannot see it. Manual inspection of each block determines which ones are fail-open. The question for each: if this except block fires during a production outage, will the caller know?

return [] on a list-returning function: fail-open if empty list looks like valid empty data
return False on a boolean-returning function: fail-open if False means "no problem found"
return default_signal on a signal function: fail-open if the default signal looks like a computed signal
return None on a function where None is a valid result: fail-open if callers do not check for None

The Checklist for New Functions

Every function that handles failure in an internal pipeline gets three questions before it is committed:

Will the caller be able to tell the difference between "empty result" and "failed to fetch"? If no: raise a typed exception, not a default.
Is this a user-facing endpoint or an internal pipeline stage? User-facing: graceful default. Internal: raise.
If this function fails silently 100 times in a row, will any alert fire? If no: the function is producing invisible failures. Add typed exceptions and let the orchestration layer alert.

The fail-open default is the bug that looks like good engineering. It was written by a careful developer who did not want the system to crash. The problem is not the intent — it is the scope. Graceful degradation is correct at the system boundary. Inside the pipeline, you want fail-loud. You want to know when something breaks. Silent failures are not safe failures. They are invisible ones.