ASK KNOX
beta
LESSON 270

Hypothesis-Driven SQL

Write queries that test assumptions, not confirm them. The SELECT that surfaced the 89% calibrator-zero rate had one job: answer a specific question about distribution. Templates for component-distribution queries that make bugs obvious.

7 min read

Exploratory queries waste time. Hypothesis-driven queries diagnose bugs.

The difference is whether the query was written to answer a specific question with a clear interpretation. Exploratory SQL says "show me some data." Hypothesis-driven SQL says "if component X is dead, the zero rate will be above 50%; let me check."

The Template

For any suspected dead scoring component, run this first:

SELECT
  COUNT(*) AS total,
  COUNT(*) FILTER (WHERE grok_score = 0) * 100.0 / COUNT(*) AS grok_zero_pct,
  COUNT(*) FILTER (WHERE perplexity_score = 0) * 100.0 / COUNT(*) AS perplexity_zero_pct,
  COUNT(*) FILTER (WHERE calibration_score = 0) * 100.0 / COUNT(*) AS calibration_zero_pct,
  COUNT(*) FILTER (WHERE news_score = 0) * 100.0 / COUNT(*) AS news_zero_pct
FROM hermes_signals
WHERE created_at > NOW() - INTERVAL '14 days';

One row of output. Four percentages. The result of running this against Hermes was:

totalgrok_zero_pctperplexity_zero_pctcalibration_zero_pctnews_zero_pct
6412.518.889.14.7

Three components are firing normally. One is dead. The diagnosis is inescapable and takes thirty seconds.

The Percentile Query

For components that are firing but producing suspicious values, run the percentile distribution:

SELECT
  percentile_cont(0.10) WITHIN GROUP (ORDER BY calibration_score) AS p10,
  percentile_cont(0.25) WITHIN GROUP (ORDER BY calibration_score) AS p25,
  percentile_cont(0.50) WITHIN GROUP (ORDER BY calibration_score) AS p50,
  percentile_cont(0.75) WITHIN GROUP (ORDER BY calibration_score) AS p75,
  percentile_cont(0.90) WITHIN GROUP (ORDER BY calibration_score) AS p90,
  MAX(calibration_score) AS max_val
FROM hermes_signals
WHERE calibration_score > 0
  AND created_at > NOW() - INTERVAL '14 days';

This reveals whether a firing component is clustering where expected or drifting. If p50 is much lower than your design expected, the component is under-contributing even when it fires. If p90 is close to the component's theoretical max, the component is saturating.

The Clearance Query

For scoring systems with a threshold, the single most important diagnostic is the clearance rate:

SELECT
  COUNT(*) AS total_signals,
  COUNT(*) FILTER (WHERE composite_score >= 70) AS cleared,
  COUNT(*) FILTER (WHERE composite_score >= 70) * 100.0 / COUNT(*) AS clearance_pct,
  AVG(composite_score) AS avg_score,
  MAX(composite_score) AS max_score
FROM hermes_signals
WHERE created_at > NOW() - INTERVAL '14 days';

If cleared = 0, the system is not trading. If max_score < 70, even the best recent signal could not clear the bar — the ceiling itself is the problem. If avg_score is far below the threshold, the rubric is misaligned with the distribution of inputs.

Inline Diagram — Query Ladder

DIAGNOSTIC QUERY LADDER

The Discipline

The four queries above map to the four most common failure hypotheses for a scoring system. Run them in order. Most bugs reveal themselves at step 1 or step 2. The rare ones that survive all four are genuinely subtle and deserve a longer investigation — but until a query has falsified the simple hypotheses, do not assume the bug is subtle.

The Rule

Write queries that test hypotheses. State the hypothesis before running the query. Interpret the result as a binary answer. Move to the next hypothesis if falsified. This is the SQL side of the scientific method, and it is the fastest way through any debugging thread involving persisted data.