Why Agents Need Performance Reviews

You have a trading bot that made 87% correct directional calls in November. By February it is making 71% correct calls. No exception was thrown. No service went down. No alert fired. The bot ran every session, processed every signal, committed every memory. It just quietly got worse.

This is the problem that behavioral observability solves. And it is a problem that traditional monitoring cannot even see.

The Silent Degradation Problem

Traditional software has a clear failure mode: it works, or it throws an error. A database query either returns results or raises an exception. An HTTP call either succeeds or returns a 4xx/5xx. Infrastructure monitoring is well-understood because the failure signal is unambiguous.

Agents are different. An agent that has drifted does not raise an exception. It returns outputs. It makes decisions. It commits memory. It runs its full session lifecycle. The drift happens inside the reasoning — in how the agent weighs evidence, frames problems, and selects responses. Nothing in the execution layer knows this is happening.

Consider what a drifting agent actually looks like in production:

A research agent that was thorough and cited sources starts producing shorter, citation-free summaries. Each one is plausible. None throws a validation error.
A trading agent that previously escalated uncertainty now makes low-confidence calls autonomously. The decisions look like the right format. They are just wrong more often.
A content agent starts producing outputs that drift away from brand voice over weeks. Each piece is grammatically correct. The cumulative effect destroys consistency.

Why Traditional Monitoring Fails Here

Standard observability — latency, error rate, throughput — measures infrastructure behavior. It answers: is the agent running? Is it fast enough? Is it crashing?

It cannot answer: is the agent making good decisions? Is it acting consistently with last month? Is it drifting away from its directives?

The gap between "is the agent running" and "is the agent behaving correctly" is the entire observability problem for AI systems.

A production agent fleet needs all three. Most teams have the first. Some have the second. Almost none have the third — until something expensive breaks.

What Drift Looks Like in the Numbers

The Agent Broker's drift detection system tracks three primary behavioral signals across every session:

Decision confidence. How certain is the agent about its outputs? Baseline confidence of 0.82 drifting to 0.61 across sessions is a signal that the agent is encountering more uncertainty — either because the problem space has changed, or the agent's calibration has degraded.

Decision count. How many decisions does the agent make per session? An agent that normally makes 12 decisions per session dropping to 4, or spiking to 28, indicates a behavioral change worth investigating.

Escalation rate. How often does the agent escalate to its manager versus handle autonomously? An agent trained on a task set should have a stable escalation rate. If it starts escalating more, it is encountering edge cases beyond its confidence. If it stops escalating on tasks it used to escalate, it may be overstepping its authority envelope.

None of these signals alone is definitive. Together, they form the behavioral fingerprint of a healthy session. Deviation from that fingerprint is drift.

The Baseline Requirement

You cannot measure drift without a baseline. This is the deceptively simple insight that most teams miss.

After deployment, the first 20 sessions are the calibration window. The system records decisions, confidence levels, and escalation patterns across those sessions. The rolling average becomes the baseline — the definition of "normal" for this agent, on this task set, in this environment.

Every session thereafter is measured against that baseline. Not against an abstract ideal of what a good agent should look like. Against this agent's established behavior.

This matters because agents are not interchangeable. A trading agent's normal confidence level might be 0.75. A content agent's might be 0.92. A drift of 0.15 from baseline means different things for each. The baseline is agent-specific, earned over real sessions.

# From broker/observability/drift_detector.py
DRIFT_THRESHOLDS = {
    "normal": (0.0, 0.15),
    "watch": (0.15, 0.25),
    "alert": (0.25, 0.40),
    "warning": (0.40, 0.60),
    "critical": (0.60, 1.0),
}

Five levels. Each has a prescribed response. Normal requires nothing. Watch surfaces in the next 1:1. Alert notifies the responsible VP. Warning triggers immediate review. Critical auto-suspends the agent.

The thresholds are not arbitrary. They are calibrated to give enough lead time for a manager to intervene before a drifting agent does serious damage. A score of 0.15 is early signal. A score of 0.60 is an emergency.

The Manager-Agent Relationship

Here is the mental model that makes behavioral observability click: every AI agent is an employee, and every employee needs a performance review.

Not because we distrust them. Because performance review is how humans and organizations detect drift in any system — biological, social, or computational. The weekly 1:1 exists in human organizations precisely because weekly feedback catches degradation that annual reviews miss.

An agent that runs for 90 days without any behavioral review is an agent that could have been drifting for 60 of those days. The damage compounds silently until someone notices something is wrong — usually at the worst possible time.

The Agent Broker's observability layer is that immune system for the agent fleet. It does not replace good agent design. It detects when good agent design has degraded — so you can intervene before the degradation compounds into a failure.

What Comes Next

This track covers the full behavioral observability stack:

Reasoning Traces (the next lesson): How to record decision rationale and build the data foundation for drift detection.
Behavioral Baselines (the “Behavioral Baselines” lesson): The 20-session calibration window and what "normal" means for an agent.
Drift Detection (the “Drift Detection” lesson): The scoring algorithm, threshold levels, and escalation logic.
Goal Alignment (the “Goal Alignment” lesson): Checking session work against active directives.
Decision Replay (the “Decision Replay” lesson): Reconstructing causal chains for post-mortems and audits.
The 1:1 Protocol (the “The 1:1 Protocol” lesson): Automated performance reviews that incorporate drift data into personalized questions.

The through-line is a single principle:

The Inverted Risk Model

Most engineering teams think about observability in terms of catching failures. Behavioral observability for agents requires an inverted risk model: you are not watching for failures. You are watching for drift that will eventually become failure.

The difference matters operationally. Failure-monitoring is reactive — you respond after something breaks. Drift-monitoring is proactive — you respond before the accumulated degradation becomes critical.

In a trading context, the difference is the gap between catching a bot before it makes 20 bad trades versus catching it after the 20th. In a content context, it is the difference between correcting brand drift before 50 articles are published versus after. In any context where agents run autonomously on high-stakes tasks, the latency between drift onset and detection is where the damage happens.

The Staffing Analogy

Think about how a competent manager handles a high-performing employee. They do not watch them every minute. They do not distrust them. But they do have regular check-ins. They do track output quality over time. They do notice when someone who was submitting 90% complete work starts submitting 70% complete work.

The response to early drift is not termination. It is a conversation. What changed? What are you uncertain about? What do you need? In most cases, the answer is correctable without major intervention.

The response to undetected drift that has been compounding for three months is much more expensive. Correcting 90 days of bad outputs is not a conversation. It is a crisis.

Behavioral observability gives you the data to have the conversation early. The 1:1 protocol (the “The 1:1 Protocol” lesson) is how that conversation is structured. But it only works if you have been collecting the behavioral data — traces, baselines, drift scores — throughout those sessions.

That data collection starts in the next lesson.