Hypothesis-Driven SQL
Write queries that test assumptions, not confirm them. The SELECT that surfaced the 89% calibrator-zero rate had one job: answer a specific question about distribution. Templates for component-distribution queries that make bugs obvious.
Exploratory queries waste time. Hypothesis-driven queries diagnose bugs.
The difference is whether the query was written to answer a specific question with a clear interpretation. Exploratory SQL says "show me some data." Hypothesis-driven SQL says "if component X is dead, the zero rate will be above 50%; let me check."
The Template
For any suspected dead scoring component, run this first:
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE grok_score = 0) * 100.0 / COUNT(*) AS grok_zero_pct,
COUNT(*) FILTER (WHERE perplexity_score = 0) * 100.0 / COUNT(*) AS perplexity_zero_pct,
COUNT(*) FILTER (WHERE calibration_score = 0) * 100.0 / COUNT(*) AS calibration_zero_pct,
COUNT(*) FILTER (WHERE news_score = 0) * 100.0 / COUNT(*) AS news_zero_pct
FROM hermes_signals
WHERE created_at > NOW() - INTERVAL '14 days';
One row of output. Four percentages. The result of running this against Hermes was:
| total | grok_zero_pct | perplexity_zero_pct | calibration_zero_pct | news_zero_pct |
|---|---|---|---|---|
| 64 | 12.5 | 18.8 | 89.1 | 4.7 |
Three components are firing normally. One is dead. The diagnosis is inescapable and takes thirty seconds.
The Percentile Query
For components that are firing but producing suspicious values, run the percentile distribution:
SELECT
percentile_cont(0.10) WITHIN GROUP (ORDER BY calibration_score) AS p10,
percentile_cont(0.25) WITHIN GROUP (ORDER BY calibration_score) AS p25,
percentile_cont(0.50) WITHIN GROUP (ORDER BY calibration_score) AS p50,
percentile_cont(0.75) WITHIN GROUP (ORDER BY calibration_score) AS p75,
percentile_cont(0.90) WITHIN GROUP (ORDER BY calibration_score) AS p90,
MAX(calibration_score) AS max_val
FROM hermes_signals
WHERE calibration_score > 0
AND created_at > NOW() - INTERVAL '14 days';
This reveals whether a firing component is clustering where expected or drifting. If p50 is much lower than your design expected, the component is under-contributing even when it fires. If p90 is close to the component's theoretical max, the component is saturating.
The Clearance Query
For scoring systems with a threshold, the single most important diagnostic is the clearance rate:
SELECT
COUNT(*) AS total_signals,
COUNT(*) FILTER (WHERE composite_score >= 70) AS cleared,
COUNT(*) FILTER (WHERE composite_score >= 70) * 100.0 / COUNT(*) AS clearance_pct,
AVG(composite_score) AS avg_score,
MAX(composite_score) AS max_score
FROM hermes_signals
WHERE created_at > NOW() - INTERVAL '14 days';
If cleared = 0, the system is not trading. If max_score < 70, even the best recent signal could not clear the bar — the ceiling itself is the problem. If avg_score is far below the threshold, the rubric is misaligned with the distribution of inputs.
Inline Diagram — Query Ladder
The Discipline
The four queries above map to the four most common failure hypotheses for a scoring system. Run them in order. Most bugs reveal themselves at step 1 or step 2. The rare ones that survive all four are genuinely subtle and deserve a longer investigation — but until a query has falsified the simple hypotheses, do not assume the bug is subtle.
The Rule
Write queries that test hypotheses. State the hypothesis before running the query. Interpret the result as a binary answer. Move to the next hypothesis if falsified. This is the SQL side of the scientific method, and it is the fastest way through any debugging thread involving persisted data.