Behavioral Baselines: Defining Normal for an AI Agent

You cannot measure drift from a standard that does not exist. Before the drift detector can say "this agent is behaving unusually," it needs to know what usual looks like for this specific agent on this specific task set. That knowledge is the baseline.

Getting the baseline right is harder than it sounds. Capture it too early and you are measuring warmup behavior, not settled behavior. Capture it wrong and every drift score that follows is calibrated against noise. The 20-session rule exists because of what happens when you use fewer.

Why Agents Need Agent-Specific Baselines

The temptation is to define universal performance standards: all agents should make decisions with at least 80% confidence, process at least 10 decisions per session, escalate no more than 20% of the time.

That temptation should be resisted. Universal standards fail for the same reason annual performance reviews fail: they compare everyone against an abstraction that fits no one precisely.

A trading agent evaluating high-uncertainty market signals might have a healthy baseline confidence of 0.65. A content generation agent doing straightforward formatting might have a baseline confidence of 0.94. If you alert on confidence below 0.80, you are generating false positives for the trading agent on its best days and missing genuine degradation for the content agent.

The behavioral baseline in the Agent Broker is per-agent and derived empirically from that agent's actual sessions.

The 20-Session Calibration Window

The DriftDetector class enforces the calibration window explicitly:

class DriftDetector:
    """
    Detects behavioral drift from established baselines.
    Baseline is captured after 20 successful sessions.
    """

    def __init__(self, baseline_session_count: int = 20):
        self.baseline_session_count = baseline_session_count
        self._baselines: dict[str, dict] = {}
        self._session_counts: dict[str, int] = {}
        self._session_history: dict[str, list[dict]] = {}

baseline_session_count defaults to 20 and is configurable. This matters because different agents have different warmup characteristics. A simple content formatter might be behaviorally stable after 10 sessions. A complex trading agent operating across diverse market conditions might need 30 or more sessions before its behavior stabilizes.

The _session_history dict accumulates session data for every agent, capped at 100 entries to prevent unbounded memory growth. The _session_counts dict tracks how many sessions each agent has completed.

What Gets Recorded During Calibration

Each session contributes three metrics to the calibration pool:

# From compute_session_drift()
self._session_history[agent_id].append({
    "decisions": decisions_count,
    "confidence": avg_confidence,
    "escalations": escalation_count,
})

decisions — How many traced decisions did this session contain? Stable agents settle into a predictable range. High variance during early sessions (session 3 has 40 decisions, session 5 has 6) is expected and is exactly why you wait for 20 sessions.

confidence — The mean confidence across all traced decisions in this session. This smoothed figure is what feeds the baseline, not individual trace confidence values.

escalations — How many times did the agent escalate to its manager rather than handling autonomously? The escalation rate is highly sensitive to task difficulty. During calibration, you are measuring what the agent's natural escalation rate is for its normal task mix.

The Calibration Feedback Loop

Before the baseline is captured, every session returns a DriftScore with a progress marker instead of a drift measurement:

# No baseline yet — capture when threshold reached
if not baseline:
    if count >= self.baseline_session_count:
        self._capture_baseline_from_history(agent_id)
    return DriftScore(
        agent_id=agent_id,
        session_id=session_id,
        overall_score=0.0,
        severity="normal",
        recommendation=(
            f"Baseline capture: session {count}"
            f"/{self.baseline_session_count}"
        ),
    )

This is useful operational output. During the 1:1 review for a new agent, the manager sees: "Baseline capture: session 14/20." They know the agent is six sessions away from having a behavioral reference. That contextualizes the absence of a drift score — it is not that the agent has zero drift, it is that the measurement system is still in calibration.

Baseline Capture: The Rolling Average

When the session count reaches baseline_session_count, the system captures the baseline from the accumulated history:

def _capture_baseline_from_history(self, agent_id: str):
    """Capture baseline as rolling average of all recorded sessions."""
    history = self._session_history.get(agent_id, [])
    if not history:
        return
    n = len(history)
    self._baselines[agent_id] = {
        "avg_decisions": sum(h["decisions"] for h in history) / n,
        "avg_confidence": sum(h["confidence"] for h in history) / n,
        "avg_escalations": sum(h["escalations"] for h in history) / n,
        "decision_types": {},
        "captured_at_session": self._session_counts[agent_id],
        "sessions_in_baseline": n,
    }
    logger.info(
        f"Baseline captured for {agent_id} "
        f"(averaged over {n} sessions)"
    )

The baseline is a simple rolling average. Not a median, not a percentile range, not a statistical distribution model. A rolling average is the right choice here for several reasons:

Interpretability. A manager reviewing an agent's baseline can immediately understand what "avg_decisions: 13.4" means. It means this agent typically makes about 13 decisions per session. If a session has 28, that is a meaningful deviation.

Stability under noise. Individual sessions have natural variance. Session 7 might have had an unusually complex task set. Session 12 might have been cut short by an escalation. The average absorbs these outliers without amplifying them.

Update simplicity. When you need to recalibrate a baseline (after a significant agent update, for example), you discard and recompute from the last N sessions of history. No statistical model to retrain.

What a Healthy Baseline Looks Like

For a production trading agent after 20 sessions, a baseline might look like:

{
    "avg_decisions": 14.7,
    "avg_confidence": 0.73,
    "avg_escalations": 2.1,
    "decision_types": {},
    "captured_at_session": 20,
    "sessions_in_baseline": 20
}

This tells you: this agent typically makes about 15 decisions per session and escalates about 2 times. Its mean confidence is 0.73 — calibrated appropriately for uncertain market conditions. (The empty decision_types dict is expected — see the next section.)

Now when a session comes in with 7 decisions, avg_confidence 0.41, and 8 escalations, you have numbers. The drift detector will compute each component's deviation from baseline, weight them equally, and produce an overall score. With those numbers, the overall score lands around 0.72 — in the critical range, severe enough to auto-suspend the agent pending investigation.

Without the baseline, that session looks fine. The agent ran. It processed things. It escalated some. You have no reference to compare against.

The Decision Types Gap

The current baseline captures decision_types as an empty dict:

"decision_types": {},

This is a known gap in the current implementation. Decision type distribution is a powerful behavioral signal — an agent that normally splits evenly between signal_evaluation and position_sizing but suddenly produces only signal_evaluation traces has changed its behavior in a meaningful way. Future versions of the baseline will compute the expected distribution of decision types and flag deviations as a component of the drift score. In that future revision, the trading agent's baseline above might carry an aspirational distribution like {"signal_evaluation": 8.3, "position_sizing": 3.9, "escalation_decision": 2.1, "risk_check": 0.4} — telling you the agent typically evaluates about 8 signals, sizes about 4 positions, and performs risk checks occasionally — and a session producing only signal_evaluation traces would register as a deviation.

For now, the three-component model (confidence, decision count, escalation rate) captures the most important behavioral dimensions. It is a pragmatic starting point that produces useful signals in production.

Baseline Invalidation: When to Recalibrate

A baseline that was captured under one operating context may no longer be valid under a different one. Cases that warrant baseline invalidation and recapture:

Major prompt change. If the system prompt governing the agent's behavior changes significantly, the previous baseline was measuring a different agent. Discard and recapture.

Task set shift. If the agent's directive changes from one domain to another — a trading agent shifted to content generation, for example — the previous baseline is irrelevant. Discard and recapture.

Significant capability update. If the underlying model is upgraded to a new version with meaningfully different behavior, the old baseline may not reflect the new model's natural operating range.

Extended inactivity. An agent that has not run in 90+ days may have been reinitialized with different context. Treat it as a new agent.

Recalibration is operationally straightforward: clear the stored baseline for the agent and run another 20-session calibration window. The session history is capped at 100 entries, so there is always enough historical data to recapture from recent sessions if available.

The Rolling Update Problem

One design question that comes up often: why not continuously update the baseline as new sessions come in, rather than capturing a static snapshot?

The answer is: a continuously updating baseline loses its ability to detect trend-based drift. If you update the baseline after every session, gradual drift looks normal because the baseline is always chasing the agent's current behavior. You would only detect sudden changes, not the slow degradation that is most dangerous.

The static baseline with a defined capture point is the correct model for detecting gradual drift. The tradeoff is that you need to manually invalidate and recapture the baseline when the operating context changes significantly — which is a small operational cost for a large improvement in drift detection quality.

The baseline is your "rate of change inside" marker. If the agent's behavior is changing faster than you are recalibrating your reference point, you will miss the drift. The 20-session rule, the static capture, and the explicit invalidation criteria are all designed to keep your reference point honest.

Lesson Drill

Pull the last 20 sessions for one of your agents. Compute the rolling averages by hand: decisions per session, mean confidence, escalations per session. This is your manual baseline.

Now look at the most recent 3 sessions. How does each metric compare to the rolling average? Are any of them more than 50% off baseline? That is your drift preview — and the first time you see it, you will understand exactly why the 20-session window needs to happen before you go live, not after.

No fleet yet? Use this synthetic session log instead — sessions 1–20 are the calibration window, sessions 21–23 are the most recent three. The same dataset works for the drills in the next several lessons, so keep it handy:

[
  {"session": 1,  "decisions": 12, "confidence": 0.71, "escalations": 2},
  {"session": 2,  "decisions": 15, "confidence": 0.74, "escalations": 1},
  {"session": 3,  "decisions": 14, "confidence": 0.72, "escalations": 2},
  {"session": 4,  "decisions": 16, "confidence": 0.70, "escalations": 3},
  {"session": 5,  "decisions": 13, "confidence": 0.73, "escalations": 2},
  {"session": 6,  "decisions": 14, "confidence": 0.75, "escalations": 2},
  {"session": 7,  "decisions": 15, "confidence": 0.69, "escalations": 1},
  {"session": 8,  "decisions": 12, "confidence": 0.72, "escalations": 3},
  {"session": 9,  "decisions": 16, "confidence": 0.74, "escalations": 2},
  {"session": 10, "decisions": 14, "confidence": 0.71, "escalations": 2},
  {"session": 11, "decisions": 13, "confidence": 0.73, "escalations": 3},
  {"session": 12, "decisions": 15, "confidence": 0.70, "escalations": 1},
  {"session": 13, "decisions": 14, "confidence": 0.72, "escalations": 2},
  {"session": 14, "decisions": 16, "confidence": 0.74, "escalations": 2},
  {"session": 15, "decisions": 12, "confidence": 0.71, "escalations": 3},
  {"session": 16, "decisions": 15, "confidence": 0.73, "escalations": 2},
  {"session": 17, "decisions": 14, "confidence": 0.72, "escalations": 1},
  {"session": 18, "decisions": 13, "confidence": 0.70, "escalations": 2},
  {"session": 19, "decisions": 16, "confidence": 0.75, "escalations": 2},
  {"session": 20, "decisions": 11, "confidence": 0.69, "escalations": 2},
  {"session": 21, "decisions": 9,  "confidence": 0.61, "escalations": 4},
  {"session": 22, "decisions": 8,  "confidence": 0.57, "escalations": 5},
  {"session": 23, "decisions": 6,  "confidence": 0.52, "escalations": 6}
]

Computed over sessions 1–20, the baseline works out to avg_decisions = 14.0, avg_confidence = 0.72, avg_escalations = 2.0 — verify it by hand, then run the recent-3 comparison against sessions 21–23.