Fail-Open Defaults
Eight P0 bugs in one audit, all sharing the same root cause: the code returned an OK default object instead of None or raising an exception. Each was minor in isolation. The failure chain they created was not.
Eight P0 bugs. One audit. Every single one shared the same root cause.
The code was not crashing. It was not throwing exceptions. It was not logging errors. It was returning default values that looked exactly like success — and execution was continuing downstream as if everything was fine.
This is the fail-open default pattern. It is the most common cause of silent production failures across the operator's trading systems, and it is the bug class that is most invisible to standard code review because each instance looks like defensive programming.
What Fail-Open Actually Means
means your code returns a default "OK" value when a failure occurs, instead of raising an exception or returning None. The execution path stays open. The downstream caller receives something that looks like a valid result. Processing continues.
The term comes from physical security: a fail-open lock unlocks when power is cut. The "safe" behavior (open) is the default failure state. In code, the analogy holds: when something goes wrong, the pipeline stays open and keeps processing — on data it does not have, with signals it did not compute, through checks it did not run.
The Sports Prediction Agent Failure Chain
Foresight is the market intelligence engine. A sports prediction agent is a separate system built on similar architecture. The audit of the sports prediction agent surfaced 8 P0 bugs, and tracing them backward revealed a single compounding chain.
Here is the chain, step by step:
Step 1: get_market_signal() fails to reach the data API. Rather than raising an exception, it returns a default neutral signal object. The function was written to "never crash" — the neutral signal is the defensive fallback.
Step 2: The downstream bet evaluator receives a valid-looking neutral signal. A neutral signal is not a stop signal — the evaluator fires a bet based on the default signal.
Step 3: detect_anomaly() times out on a DB query. Rather than raising, it returns False. No anomaly detected. The bet proceeds without any anomaly flag.
Step 4: The heartbeat monitor was querying the wrong table — a separate bug, but the fail-open default in detect_anomaly() removed the last check that might have caught it. The bet is invisible to monitoring.
Step 5: After resolution, the event is written to the deduplication table with the original event hash. The original outcome data was never written. The event is now permanently blocked — the hash exists, the data does not. Any future retry for this event will be rejected as a duplicate.
Five fail-open defaults. Each individually defensible. Together: an unrecoverable state, an invisible bet, and a poisoned dedup record.
The InDecision Engine Example
The same pattern in a different system. The InDecision engine's OHLCV fetcher:
# Fail-open (what it was doing)
def fetch_ohlcv(asset: str) -> list:
try:
return api.get_ohlcv(asset)
except Exception:
return [] # caller thinks this is valid empty data
The caller could not distinguish between two entirely different conditions:
[]because this asset has no historical data[]because the API was down, rate-limited, or authentication failed
In both cases, the caller received an empty list. In both cases, the scoring engine received zero data points. In both cases, scoring silently degraded to a neutral signal.
During API outages, the InDecision engine appeared to be operating normally. It was running its full pipeline on empty data, producing neutral scores, generating no alerts. The system looked healthy. It was processing nothing.
Why Developers Write Fail-Open Defaults
The instinct is correct in context. Defensive programming tells you to return a safe default instead of crashing. For a user-facing endpoint, that instinct is right: return a graceful error page instead of a 500. Return cached data instead of failing completely.
The problem is applying that instinct uniformly to internal pipeline stages where "safe" means something different. In a data pipeline, "safe" means the caller knows whether the data is valid. A neutral signal object is not safe if the caller cannot tell it apart from a successfully computed neutral signal. An empty list is not safe if the caller cannot tell it apart from legitimately empty data.
The rule is about information, not exceptions: does the caller have enough information to make the right decision?
If the answer is no — if the return value looks identical whether the call succeeded or failed — then you have a fail-open default, and you need to change it.
The Correct Pattern
# Fail-open (wrong for internal pipeline)
def fetch_ohlcv(asset: str) -> list:
try:
return api.get_ohlcv(asset)
except Exception:
return [] # caller cannot distinguish failure from empty
# Fail-loud (correct)
def fetch_ohlcv(asset: str) -> list:
try:
return api.get_ohlcv(asset)
except RateLimitError:
raise DataFetchError(f"Rate limited fetching {asset}")
except NetworkError as e:
raise DataFetchError(f"Network failure fetching {asset}: {e}")
Typed exceptions give the caller information. DataFetchError tells the caller the fetch failed. The caller can then decide: skip this cycle and alert, use cached data from last cycle, halt the pipeline, backfill later. With a silent return [], the caller has no decision to make — it just proceeds on nothing.
The pattern is: raise typed exceptions after retries; let callers decide.
User-facing endpoints catch exceptions and return graceful HTTP errors. Internal pipeline stages raise exceptions and surface them to whoever is orchestrating the pipeline. Different layers, different contracts.
The Detection Pattern in Code Audits
# Find except blocks — these are all candidates
grep -n "except.*:" src/*.py | grep -Ev "raise|log|logger"
# The filter removes except blocks that re-raise or log
# What remains: except blocks that silently swallow errors
# Inspect each one: does it return a value that looks like success?
The grep surfaces candidates. Manual inspection determines which ones are fail-open. The question for each: if this except block fires during a production outage, will the caller know?
return []on a list-returning function: fail-open if empty list looks like valid empty datareturn Falseon a boolean-returning function: fail-open if False means "no problem found"return default_signalon a signal function: fail-open if the default signal looks like a computed signalreturn Noneon a function where None is a valid result: fail-open if callers do not check for None
The Checklist for New Functions
Every function that handles failure in an internal pipeline gets three questions before it is committed:
-
Will the caller be able to tell the difference between "empty result" and "failed to fetch"? If no: raise a typed exception, not a default.
-
Is this a user-facing endpoint or an internal pipeline stage? User-facing: graceful default. Internal: raise.
-
If this function fails silently 100 times in a row, will any alert fire? If no: the function is producing invisible failures. Add typed exceptions and let the orchestration layer alert.
The fail-open default is the bug that looks like good engineering. It was written by a careful developer who did not want the system to crash. The problem is not the intent — it is the scope. Graceful degradation is correct at the system boundary. Inside the pipeline, you want fail-loud. You want to know when something breaks. Silent failures are not safe failures. They are invisible ones.