Fire-Rate Alerting
The auto-alert pattern. Every scoring component emits fire_rate_24h. Below 20% warning, below 5% critical. The dashboard that would have caught Hermes weeks earlier — and the template for every scoring bot in the ecosystem.
Fire-rate alerting is the single highest-ROI monitoring change any scoring system can make. It is also the change that would have caught the Hermes calibrator failure weeks earlier — and every similar silent failure in any system that depends on scoring components with variable fire rates.
The Alert Rules
Two tiers, one metric, 24-hour rolling window:
- alert: ComponentFireRateWarning
expr: |
sum(rate(score_component_fired_total{component=~".*"}[24h])) by (component)
/
sum(rate(score_component_total{component=~".*"}[24h])) by (component)
< 0.20
for: 1h
severity: warning
annotations:
summary: "Component {{ $labels.component }} fire rate below 20% over 24h"
runbook: "https://runbook.knox/fire-rate-warning"
- alert: ComponentFireRateCritical
expr: |
sum(rate(score_component_fired_total{component=~".*"}[24h])) by (component)
/
sum(rate(score_component_total{component=~".*"}[24h])) by (component)
< 0.05
for: 30m
severity: critical
annotations:
summary: "Component {{ $labels.component }} fire rate below 5% over 24h — likely dead"
runbook: "https://runbook.knox/fire-rate-critical"
The warning threshold of 20% catches slow drift — a component that used to fire at 60% and is now at 18% is telling you something changed before it goes fully silent. The critical threshold of 5% catches dead components like the Hermes calibrator at 3.8%. Both fire within a day of the condition appearing.
The Dashboard
Alerts handle reactive response. Dashboards handle proactive inspection. Every scoring bot should have a single-panel dashboard showing all components' 24-hour fire rates side by side, with the warning and critical thresholds drawn as horizontal lines. Five bars, two lines, instant readability.
Inline Diagram — Fire-Rate Dashboard Mockup
The Runbook Link
Every alert should include a runbook link. The runbook for a fire-rate-critical page looks like:
- Check the dashboard. Confirm which component is below 5%.
- Check for recent deploys. Any change to the component's code in the last 24h? If so, consider rollback.
- Check upstream dependencies. Is the data source responding? Is the API returning valid data?
- Query the component distribution directly. Run the zero-rate query from Lesson 270. Confirm the dashboard is accurate.
- Identify root cause. Is the component broken, is the input data missing, or is the match rate legitimately low (like the Hermes calibrator)?
- Mitigate. Rebalance weights if the component is legitimately dead for structural reasons. Fix the code if it is broken. Fix the data pipeline if the inputs are missing.
The runbook exists so the on-call engineer knows exactly what to do without having to reason from first principles at 3 AM.
The Rule
Fire-rate alerting is not optional for any production scoring system. The setup cost is small. The value is enormous. The Hermes calibrator failure is the exact type of bug this pattern was designed to catch — and the exact type of bug every scoring bot will eventually encounter if it depends on external data sources.