Ask Knox

Fire-rate alerting is the single highest-ROI monitoring change any scoring system can make. It is also the change that would have caught the Agent Framework calibrator failure weeks earlier — and every similar silent failure in any system that depends on scoring components with variable fire rates.

The Alert Rules

Two tiers, one metric, 24-hour rolling window:

groups:
  - name: fire-rate
    rules:
      - alert: ComponentFireRateWarning
        expr: |
          sum(rate(score_component_fired_total{component=~".*"}[24h])) by (component)
          /
          sum(rate(score_component_total{component=~".*"}[24h])) by (component)
          < 0.20
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Component {{ $labels.component }} fire rate below 20% over 24h"
          runbook: "https://your-runbook-host/fire-rate-warning"  # replace with your runbook URL

      - alert: ComponentFireRateCritical
        expr: |
          sum(rate(score_component_fired_total{component=~".*"}[24h])) by (component)
          /
          sum(rate(score_component_total{component=~".*"}[24h])) by (component)
          < 0.05
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Component {{ $labels.component }} fire rate below 5% over 24h — likely dead"
          runbook: "https://your-runbook-host/fire-rate-critical"  # replace with your runbook URL

Note the structure: this is a complete Prometheus rules file — alert rules live under a groups: → rules: wrapper, and severity must be nested under labels: (a top-level severity field fails promtool check rules). Copy the whole block, not just the two rules.

The warning threshold of 20% catches slow drift — a component that used to fire at 60% and is now at 18% is telling you something changed before it goes fully silent. The critical threshold of 5% catches dead components like the Agent Framework calibrator at 3.8%. Both fire within a day of the condition appearing.

The Dashboard

Alerts handle reactive response. Dashboards handle proactive inspection. Every scoring bot should have a single-panel dashboard showing all components' 24-hour fire rates side by side, with the warning and critical thresholds drawn as horizontal lines. Five bars, two lines, instant readability.

Inline Diagram — Fire-Rate Dashboard Mockup

Note that in this dashboard snapshot, Composite sits at 12% — a second component already in warning state alongside the dead Calibrator. In a real deployment, Composite at 12% would be a separate warning-tier alert requiring its own investigation in parallel with the Calibrator's critical-tier page.

The Runbook Link

Every alert should include a runbook link. The runbook for a fire-rate-critical page looks like:

Check the dashboard. Confirm which component is below 5%.
Check for recent deploys. Any change to the component's code in the last 24h? If so, consider rollback.
Check upstream dependencies. Is the data source responding? Is the API returning valid data?
Query the component distribution directly. Run the zero-rate query from the “Hypothesis-Driven SQL” lesson (Evidence-First Debugging track). Confirm the dashboard is accurate.
Identify root cause. Is the component broken, is the input data missing, or is the match rate legitimately low (like the Agent Framework calibrator)?
Mitigate. Rebalance weights if the component is legitimately dead for structural reasons. Fix the code if it is broken. Fix the data pipeline if the inputs are missing.

The runbook exists so the on-call engineer knows exactly what to do without having to reason from first principles at 3 AM.

The Rule

Fire-rate alerting is not optional for any production scoring system. The setup cost is small. The value is enormous. The Agent Framework calibrator failure is the exact type of bug this pattern was designed to catch — and the exact type of bug every scoring bot will eventually encounter if it depends on external data sources.

Fire-Rate Alerting

The Alert Rules

The Dashboard

Inline Diagram — Fire-Rate Dashboard Mockup

The Runbook Link

The Rule