ASK KNOX
beta
LESSON 258

Fire-Rate Monitoring

Every scoring component needs a fire_rate_24h metric. Below 20%, it warns. Below 5%, it pages. The Hermes calibrator contributed zero on 96% of signals for weeks and nothing alerted — because the system was counting errors, not contributions.

9 min read

Error rate is not the right metric for a scoring component. Fire rate is.

Error rate tells you whether the component crashed. Fire rate tells you whether the component contributed anything. These are completely different questions, and conflating them is how you end up with a bot that has 340 tests passing, no alerts firing, and zero trades for a month.

The Metric

For every scoring component C in your system, emit a standard log line once per score computation:

[score] component=calibrator value=0.0 fire=false threshold_contribution=0.0
[score] component=grok value=27.4 fire=true threshold_contribution=27.4

Roll this up into a fire_rate_24h metric: the percentage of the last 24 hours of signals where fire=true. Ship it to your existing metrics backend next to error rate, latency, and request count. The metric is as fundamental as any of those.

The Alerting Pattern

The warning threshold catches slow drift. A component that used to fire on 60% of signals and now fires on 18% is telling you something changed — maybe the data source shifted, maybe the matching logic is stale, maybe the input distribution moved. Investigate before the component goes fully silent.

The critical threshold catches dead components. At 5% or below, the component is contributing essentially nothing to your scoring. If it was load-bearing by design, the system is broken. If it was additive by design, you have dead weight to remove. Either way, you need to know right now, not when someone finally notices the bot has not traded in two weeks.

Why Error Rate Fails

A dead component is typically well-behaved. It catches its exception, returns 0 or an empty default, logs nothing, and moves on. The error rate is zero because no error occurred. The component looks perfect on every dashboard.

This is exactly what happened to the Hermes calibrator. The semantic matcher would query Metaculus and Manifold, find no corresponding market, return an empty candidate list, and the scoring function would gracefully return 0. No exception. No warning. No error metric. Just a silent, steady stream of zeros that collapsed the effective ceiling of the scoring rubric and blocked every trade.

If the calibrator had been emitting fire_rate_24h, the 3.8% rate would have paged on day one.

The Inline SVG — Fire-Rate Gauge

FIRE RATE 24h — ALERT ZONES0%5%20%100%CRITICALWARNHEALTHYHERMES CALIBRATOR3.8% for weeksGROK NARRATIVE82% fire rate

The Detection Template

Add this to every scoring service:

def record_score(component: str, value: float, maximum: float) -> None:
    fired = value > 0
    logger.info(
        "score.component",
        extra={
            "component": component,
            "value": value,
            "maximum": maximum,
            "fire": fired,
            "ratio": value / maximum if maximum else 0.0,
        },
    )
    metrics.increment(f"score.{component}.total")
    if fired:
        metrics.increment(f"score.{component}.fired")

Then set two alerts in your monitoring config:

- alert: ComponentFireRateWarning
  expr: rate(score_fired[24h]) / rate(score_total[24h]) < 0.20
  severity: warning
- alert: ComponentFireRateCritical
  expr: rate(score_fired[24h]) / rate(score_total[24h]) < 0.05
  severity: critical

Thirty minutes of setup. A failure mode that eats weeks of production if you skip it.

The Rule

Silence is not health. Every scoring component emits fire rate. Every fire rate has two alert thresholds. No component ships without them. The Hermes calibrator ran dead for weeks because fire-rate monitoring did not exist. Now it does — and the pattern travels to every scoring bot in the ecosystem.