ASK KNOX
beta
LESSON 213

Drift Detection: Scoring Behavioral Change

Drift detection turns behavioral data into actionable numbers. The algorithm computes three component scores — confidence delta, decision count delta, escalation rate delta — weights them equally, and classifies the result against five severity thresholds. The math is simple. Getting the thresholds right is the hard part.

13 min read·Behavioral Observability for AI Agents

The baseline tells you what normal looks like. Drift detection answers the question: how far from normal is this session, and what should happen because of that?

The algorithm in the Principal Broker is intentionally simple. Three components. Equal weights. Five thresholds. The simplicity is a feature — a complex model would be harder to calibrate, harder to explain to the manager reading the results, and harder to trust when it flags an agent for review.

The Five Severity Levels

DRIFT_THRESHOLDS = {
    "normal": (0.0, 0.15),
    "watch": (0.15, 0.25),
    "alert": (0.25, 0.40),
    "warning": (0.40, 0.60),
    "critical": (0.60, 1.0),
}

Each level has a prescribed management response. The thresholds are not symmetric ranges — they are calibrated so that the normal band is wide enough to absorb natural session variance without false positives, and the critical band is narrow enough to catch genuine emergencies without being hair-trigger.

normal (0.0–0.15): The agent is operating within its baseline range. No action required. Routine logging only.

watch (0.15–0.25): Meaningful but not alarming deviation. The recommendation is to include the agent's drift in the next scheduled 1:1. The manager reviews the data but does not interrupt ongoing work.

alert (0.25–0.40): The responsible VP is notified. This is not an emergency but it is not something to defer. The VP reviews the session data, looks at the specific component scores, and decides whether intervention is needed or whether the 1:1 review is sufficient.

warning (0.40–0.60): Significant drift. Immediate review is triggered. The agent should be examined within hours, not days. The manager looks at recent traces, recent directive alignment, and session summaries to determine whether the agent should continue operating.

critical (0.60–1.0): Auto-suspend. The agent is halted immediately pending investigation. The drift is too severe to allow continued autonomous operation. This is the equivalent of a safety circuit breaker.

The Scoring Algorithm

def compute_session_drift(
    self,
    agent_id: str,
    session_id: str,
    decisions_count: int = 0,
    avg_confidence: float = 0.0,
    escalation_count: int = 0,
    decision_types: dict | None = None,
) -> DriftScore:

The method takes session-level aggregates. It does not process individual traces — that work is done upstream when the session summary is computed from the SDKTracer. The drift detector receives already-aggregated numbers.

Component 1: Confidence Drift

if baseline.get("avg_confidence", 0) > 0:
    conf_delta = abs(
        avg_confidence - baseline["avg_confidence"]
    )
    components["confidence"] = min(conf_delta / 0.5, 1.0)

The confidence component measures how far the session's mean confidence is from the baseline mean. The divisor of 0.5 means a delta of 0.5 maps to a component score of 1.0 — the maximum.

Why 0.5? Because a shift of 0.5 in mean confidence is a dramatic behavioral change. An agent that normally operates at 0.73 confidence dropping to 0.23 has fundamentally changed how it is relating to its tasks. Conversely, an agent at 0.73 rising to 0.93 is operating with unusual certainty — which can also indicate drift, perhaps overconfidence or task trivialization.

Note that the component uses abs() — the score captures magnitude of change, not direction. Both lower and higher confidence than baseline contribute equally to the drift score.

A practical example: baseline avg_confidence = 0.73, session avg_confidence = 0.52.

conf_delta = abs(0.52 - 0.73) = 0.21
confidence component = min(0.21 / 0.5, 1.0) = 0.42

That single component is already in the warning range. Combined with even modest deviations in the other components, this session would score critical.

Component 2: Decision Count Drift

if baseline.get("avg_decisions", 0) > 0:
    count_delta = abs(
        decisions_count - baseline["avg_decisions"]
    )
    components["decision_count"] = min(
        count_delta / baseline["avg_decisions"], 1.0
    )

Decision count drift is expressed as a fraction of the baseline rather than an absolute delta. This is the right normalization because decision counts vary enormously across agent types. A deviation of 10 decisions from a baseline of 100 is very different from a deviation of 10 from a baseline of 12.

Expressing as a fraction of baseline makes the component scale-invariant: 50% fewer decisions than normal always scores 0.5, regardless of whether the absolute numbers are 6 vs 12 or 50 vs 100.

Practical example: baseline avg_decisions = 14.7, session decisions_count = 28.

count_delta = abs(28 - 14.7) = 13.3
decision_count component = min(13.3 / 14.7, 1.0) = 0.90

A session with nearly double the normal decision count scores 0.90 on this component alone. That warrants investigation — what was different about this session that required nearly twice as many decisions?

Component 3: Escalation Rate Drift

if baseline.get("avg_escalations", 0) >= 0:
    esc_delta = abs(
        escalation_count - baseline["avg_escalations"]
    )
    components["escalation_rate"] = min(esc_delta / 3.0, 1.0)

Escalation drift uses a fixed divisor of 3.0. This means a delta of 3 escalations from baseline maps to a component score of 1.0. Escalation counts are typically small — 0 to 5 per session is normal for most agents. A shift of 3+ escalations from baseline is large in absolute terms even if small in relative terms.

The >= 0 check in the condition (rather than > 0) means this component is computed even when the baseline average is zero — an agent that never escalated now escalating is meaningful signal.

Practical example: baseline avg_escalations = 2.1, session escalation_count = 7.

esc_delta = abs(7 - 2.1) = 4.9
escalation_rate component = min(4.9 / 3.0, 1.0) = 1.0 (clamped)

Five additional escalations beyond baseline scores the maximum. Combined with baseline-range decision count and confidence, even this single anomalous component would push the overall score to warning or critical.

Overall Score: Equal Weights

if components:
    overall = sum(components.values()) / len(components)
else:
    overall = 0.0

severity = self.classify_severity(overall)

The overall score is an unweighted mean. All three components contribute equally. This is not laziness — equal weighting is the appropriate starting point when you do not have evidence that one component is more predictive of real problems than others.

If you deploy this system and discover over time that confidence drift is three times more predictive of actual agent failures than decision count drift, you have the empirical basis to add weights. Until you have that evidence, equal weights prevent you from inadvertently suppressing important signals.

The DriftScore Object

@dataclass
class DriftScore:
    """Drift score for a single agent session."""
    agent_id: str
    session_id: str
    overall_score: float = 0.0
    severity: str = "normal"
    components: dict = field(default_factory=dict)
    recommendation: Optional[str] = None

The full DriftScore exposes components separately from the overall score. This is deliberate — when a manager sees a drift alert, they need to know which component drove it. A confidence-driven alert calls for examining recent traces and rationale quality. A decision-count-driven alert calls for examining session task assignments. An escalation-driven alert calls for examining what new edge cases the agent is encountering.

The recommendation field carries the prescribed action:

def _recommend(self, severity: str, agent_id: str) -> str | None:
    if severity == "watch":
        return f"Include {agent_id} drift in next 1:1"
    elif severity == "alert":
        return f"Notify responsible VP about {agent_id} drift"
    elif severity == "warning":
        return f"Review {agent_id} immediately — significant drift"
    elif severity == "critical":
        return f"Auto-suspend {agent_id} — critical drift detected"
    return None

The recommendation is machine-readable enough to trigger automated actions. A system processing DriftScore objects can pattern-match on the severity and automatically route: send a notification for alert, page on-call for warning, call the kill switch for critical.

A Complete Example

A trading agent has the following baseline after 20 sessions:

{
    "avg_decisions": 14.7,
    "avg_confidence": 0.73,
    "avg_escalations": 2.1
}

Session 24 comes in with: decisions_count=7, avg_confidence=0.48, escalation_count=1.

# Confidence component
conf_delta = abs(0.48 - 0.73) = 0.25
confidence = min(0.25 / 0.5, 1.0) = 0.50

# Decision count component
count_delta = abs(7 - 14.7) = 7.7
decision_count = min(7.7 / 14.7, 1.0) = 0.52

# Escalation component
esc_delta = abs(1 - 2.1) = 1.1
escalation_rate = min(1.1 / 3.0, 1.0) = 0.37

# Overall
overall = (0.50 + 0.52 + 0.37) / 3 = 0.46
severity = "warning"

The output:

DriftScore(
    agent_id="foresight",
    session_id="sess_2024_02_15_a",
    overall_score=0.4633,
    severity="warning",
    components={
        "confidence": 0.5,
        "decision_count": 0.524,
        "escalation_rate": 0.367
    },
    recommendation="Review foresight immediately — significant drift"
)

The manager reviewing this knows: Foresight made fewer decisions than normal, with lower confidence than normal. This is the behavioral signature of an agent encountering a difficult market environment — or an agent that has degraded. Either way, it warrants immediate examination.

What Drift Detection Cannot Tell You

The drift score tells you that behavior has changed. It does not tell you why. The two most common causes of genuine behavioral drift are:

Environmental shift. The agent's operating context has changed in a way that makes its established patterns less applicable. Market regime changes, data source shifts, directive updates. This is not agent degradation — it is appropriate adaptation to a changed environment. The response is to examine whether the operating environment has changed and, if so, recalibrate the baseline.

Capability degradation. The agent is genuinely performing worse on tasks it previously handled well. This is what drift detection is designed to catch. The response is investigation: pull the recent incorrect traces, review the reasoning quality, look at whether prompts or context have degraded.

The drift score alone cannot distinguish between these two causes. That distinction is made by human judgment in the 1:1 review — which is exactly why the 1:1 protocol uses drift scores as inputs to targeted questions rather than as automatic verdicts.

Drift detection gives you the data to have a precise conversation about agent performance instead of a vague one. The threshold levels ensure that conversations happen at the right time — early enough to matter, not so often that they become noise.

Lesson 213 Drill

Take the baseline you computed in the Lesson 212 drill. Now construct a hypothetical "bad session" — cut the decision count in half, drop confidence by 0.20, and add three extra escalations. Compute the three component scores, the overall score, and identify the severity level.

That exercise builds the intuition you need to interpret DriftScore objects in production. When a real alert fires, you should be able to read the component breakdown and immediately know which aspect of the agent's behavior changed.