ASK KNOX
beta
LESSON 216

The 1:1 Protocol: Automated Performance Reviews for AI Agents

The 1:1 is the human management layer of behavioral observability. Every week, every director-level agent receives a personalized review request: standard questions plus targeted questions generated from drift scores and incorrect traces. The agent responds. The manager synthesizes. Knox approves what requires human judgment. The loop closes.

14 min read·Behavioral Observability for AI Agents

Every piece of behavioral data collected in the previous five lessons — the traces, the baselines, the drift scores, the alignment checks, the replay chains — builds toward this: a weekly performance review that gives every director-level agent a structured opportunity to report on their work, surface blockers, and propose changes, while giving their manager the information needed to evaluate performance with data rather than intuition.

This is the 1:1 protocol. It is not a chatbot interaction. It is not a manual review. It is a formalized, automated review cycle that closes the feedback loop between agent behavior and management oversight.

Why Automated Reviews Work

Human organizations run 1:1s because they work. Weekly check-ins catch small problems before they become large ones, surface blockers that would otherwise remain invisible, build trust between manager and report, and create a structured venue for proposing changes to working parameters.

These benefits apply equally to AI agent management. The difference is scale and consistency. A human manager can maintain genuine weekly 1:1s with 5-7 direct reports. A human managing 20+ agents cannot. The review cadence collapses under the load, and agents go weeks without feedback — which is exactly when drift compounds silently.

The automated 1:1 protocol solves the scale problem. It runs weekly for every director-level agent without human initiation. The questions are generated from data. The response is structured. The synthesis is guided. The items requiring Knox's judgment are surfaced explicitly. The entire process completes whether or not anyone is paying attention — and the data is waiting when the manager does review.

The OneOnOneRequest: What Gets Sent to the Agent

@dataclass
class OneOnOneRequest:
    """A 1:1 review request sent to an agent."""
    request_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    agent_id: str = ""
    manager_id: str = ""
    questions: list[str] = field(default_factory=list)
    drift_context: dict = field(default_factory=dict)
    created_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )

The request is sent from the manager to the agent. The questions list is the heart of it — a combination of standard questions that every agent receives and personalized questions generated from that agent's specific behavioral data.

The Standard Questions

STANDARD_QUESTIONS = [
    "What were your key accomplishments this period?",
    "What blockers or challenges are you facing?",
    "Are there any proposed changes to your operating parameters?",
    "What support do you need from your manager or peers?",
]

These four questions are unchanged across every agent and every review cycle. They establish a consistent baseline for the review — every agent answers the same questions every week, making trend analysis possible across reviews.

The specific wording matters. "Proposed changes to your operating parameters" is an explicit invitation for the agent to flag if its prompts, directives, or authority limits need adjustment. Without this question, an agent operating near the limits of its configuration might not raise the issue proactively.

Personalized Question Generation

def generate_request(
    self,
    agent_id: str,
    manager_id: str,
    drift_score: dict | None = None,
    incorrect_traces: list[dict] | None = None,
    drift_data: dict | None = None,
    outcome_data: dict | None = None,
) -> OneOnOneRequest:
    questions = list(STANDARD_QUESTIONS)

    # Add personalized questions based on drift data
    if drift_score:
        severity = drift_score.get("severity", "normal")
        if severity in ("alert", "warning", "critical"):
            questions.append(
                f"Your behavioral drift score is {severity}. "
                f"What factors are contributing to this change?"
            )
        components = drift_score.get("components", {})
        if components.get("confidence", 0) > 0.2:
            questions.append(
                "Your decision confidence has shifted from baseline. "
                "What's driving this change?"
            )
        if components.get("escalation_rate", 0) > 0.2:
            questions.append(
                "Your escalation rate has changed. "
                "Are you encountering more edge cases?"
            )

    # Add questions about incorrect decisions
    if incorrect_traces:
        questions.append(
            f"You had {len(incorrect_traces)} decisions with "
            f"incorrect outcomes. What patterns do you see?"
        )

The personalization logic has two sources.

Drift-driven questions fire when the drift score is elevated. The system generates questions at three levels:

  • Severity-level question: fires when severity is alert, warning, or critical. This is the top-level acknowledgment that something behavioral has changed and asks the agent to explain it.
  • Confidence component question: fires when confidence drift component exceeds 0.2. Specifically targets the confidence shift rather than the overall score.
  • Escalation component question: fires when escalation rate component exceeds 0.2. Targets the escalation pattern specifically.

Outcome-driven questions fire when the agent has incorrect trace outcomes from the period. This is the most direct performance feedback mechanism in the protocol: you had N decisions that produced wrong results, tell me what you see in retrospect.

This combination — behavioral drift data and outcome correctness data — is what makes the 1:1 questions genuinely useful rather than boilerplate. An agent that has never had a high drift score and has no incorrect outcomes will receive exactly the four standard questions. An agent with a warning drift score and three incorrect traces will receive seven questions, four of which are specifically calibrated to its situation.

The OneOnOneResponse: What the Agent Returns

@dataclass
class OneOnOneResponse:
    """An agent's response to a 1:1 review."""
    request_id: str = ""
    agent_id: str = ""
    answers: list[dict] = field(default_factory=list)
    decisions_made: list[str] = field(default_factory=list)
    blockers: list[str] = field(default_factory=list)
    proposed_changes: list[dict] = field(default_factory=list)
    submitted_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )

The response structure separates four types of content:

answers — The agent's direct responses to each question, structured as {question: str, answer: str} pairs for clean rendering and storage.

decisions_made — A list of significant decisions the agent made during the period, extracted from its session memory. This provides a factual record independent of the answer prose.

blockers — Explicit list of things preventing the agent from performing better. Blockers are surfaced to the manager for action — they may require changes to tool access, directive clarity, data availability, or cross-agent coordination.

proposed_changes — Structured proposals for parameter changes: {parameter: str, current: value, proposed: value, rationale: str}. This is the formal channel for an agent to request adjustments to its operating configuration.

The submit_response() method validates that the request exists before accepting a response:

def submit_response(
    self,
    request_id: str,
    answers: list[dict],
    decisions_made: list[str] | None = None,
    blockers: list[str] | None = None,
    proposed_changes: list[dict] | None = None,
) -> OneOnOneResponse | None:
    request = self._requests.get(request_id)
    if not request:
        return None

None is returned if the request ID is unknown — which would indicate either a stale reference or an attempt to submit a response for a review that was never sent. This guard prevents orphaned responses from cluttering the synthesis pipeline.

The ManagerSynthesis: Closing the Loop

@dataclass
class ManagerSynthesis:
    """Manager's synthesis of a 1:1 response."""
    request_id: str = ""
    agent_id: str = ""
    manager_id: str = ""
    performance_rating: str = "satisfactory"
    approved_changes: list[dict] = field(default_factory=list)
    rejected_changes: list[dict] = field(default_factory=list)
    action_items: list[dict] = field(default_factory=list)
    knox_required: list[dict] = field(default_factory=list)
    synthesized_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )

The synthesis is where the manager acts on the agent's response. The structure forces explicit decisions on every piece of content the agent surfaced.

performance_rating — One of three values: exceeds, satisfactory, or needs-improvement. This is the manager's top-level assessment for the period. It goes into the agent's performance record and trends over time. An agent with three consecutive needs-improvement ratings needs intervention at a higher level.

approved_changes and rejected_changes — Every proposed change must be explicitly approved or rejected. There is no "pending" state — the synthesis closes the loop on every proposal. Rejected changes are preserved in the record, not discarded. This matters for pattern detection: an agent that keeps proposing the same change that keeps getting rejected has a communication problem or the manager has an explanation problem.

action_items — What the manager commits to doing as a result of this 1:1. May include clearing blockers, adjusting directives, coordinating with other agents, or requesting additional tooling.

knox_required — Items that exceed the manager's authority and require Knox's approval. This is the escalation mechanism for the 1:1 system.

def get_knox_required_items(self) -> list[dict]:
    """Get all items requiring Knox approval across all 1:1s."""
    items = []
    for syn in self._syntheses.values():
        for item in syn.knox_required:
            items.append({
                **item,
                "agent_id": syn.agent_id,
                "request_id": syn.request_id,
            })
    return items

This method aggregates all Knox-required items across every active synthesis. The Principal Broker's Action Queue reads this regularly and surfaces pending items to Knox through the notification system — typically via Discord, since that is the primary communication channel.

The drift_context: Behavioral Data in the Request

The request carries drift_context — behavioral data embedded directly in the review request for the agent to see when formulating its response:

if drift_data is not None:
    baseline = drift_data.get("baseline") or {}
    context["behavioral_context"] = {
        "drift_score": baseline.get("overall_score"),
        "severity": drift_data.get("severity"),
        "has_baseline": drift_data.get("has_baseline", False),
        "recommendation": drift_data.get("recommendation"),
    }

if outcome_data is not None:
    context["outcome_summary"] = {
        "recent_decisions": outcome_data.get("recent_decisions"),
        "correct_outcomes": outcome_data.get("correct_outcomes"),
        "incorrect_outcomes": outcome_data.get("incorrect_outcomes"),
        "accuracy": outcome_data.get("accuracy"),
    }

Sharing this context with the agent is a deliberate design choice. An agent that knows its drift score is warning and its accuracy is 0.71 when answering "what factors are contributing to the change?" can give a more grounded and useful answer than an agent receiving only the question with no data.

This is the difference between "you have been behaving unusually" and "here are the numbers: your confidence dropped by 0.21 from baseline, your decision count is 48% below normal, your accuracy this period is 0.71 against a baseline of 0.83." The second framing produces richer, more actionable responses.

A Complete 1:1 Cycle

A weekly cycle for a director-level trading agent looks like:

Sunday night, automated:

  1. Drift detector runs final session of the week through compute_session_drift()
  2. Traces with outcome_correct = False are queried for the week
  3. generate_request() builds a personalized question set incorporating drift score and incorrect trace count
  4. OneOnOneRequest is delivered to the agent's inbox via the A2A message bus

Monday morning, automated: 5. Agent processes the request in its next session 6. Agent calls submit_response() with answers, blockers, and proposed changes 7. Response is stored in the protocol for manager review

Monday afternoon, manager-driven or automated synthesis: 8. Manager agent (or the OpenClaw acting as interim manager) calls synthesize() 9. Each proposed change is explicitly approved or rejected 10. Action items are assigned 11. Knox-required items are added to the Action Queue

Monday evening, Knox reviews Action Queue: 12. Knox sees: "Foresight requests authority to increase max position size from $500 to $750 for high-conviction signals. Manager-approved pending Knox confirmation." 13. Knox approves or rejects with a single command 14. Approved changes are applied. Rejected changes are logged.

And it happens automatically, every week, for every director-level agent.

Making It Compound

The 1:1 protocol is not just a management tool. It is a learning system. Over time, the data it accumulates becomes increasingly valuable:

Performance trends. Three months of weekly performance_rating values for an agent shows whether its performance is improving, stable, or degrading. A trend from satisfactory to needs-improvement is a much stronger signal than a single bad week.

Proposed change patterns. An agent that proposes the same parameter change in three consecutive 1:1s is sending a signal. Either the change is genuinely needed (and the manager should approve it or explain the refusal clearly) or the agent has a misunderstanding that needs correction.

Blocker resolution latency. How long do blockers persist? If a blocker from six weeks ago is still in the blocker list, something is wrong with the resolution process.

Drift correlation with outcomes. Does elevated drift in week N predict lower accuracy in week N+1? If so, the drift threshold for manager intervention should probably be lowered — acting on watch-level drift might prevent the accuracy drop from materializing.

None of this analysis is possible without a consistent, structured review cycle. The 1:1 protocol is the data collection mechanism for the long-term behavioral intelligence you need to run an agent fleet with confidence.

The 1:1 protocol is the mechanism for telling agents what you expect and giving them a structured channel to tell you what they need. The behavioral data makes the conversation precise. The automated cadence makes it consistent. The Knox approval layer makes it safe.

That combination — precision, consistency, and safety — is the foundation of a trustworthy autonomous agent fleet.

Track Completion

You now have the full behavioral observability stack:

  • Lesson 210: The drift problem and why traditional monitoring fails
  • Lesson 211: Reasoning traces — the data foundation
  • Lesson 212: Behavioral baselines — defining normal
  • Lesson 213: Drift detection — scoring behavioral change
  • Lesson 214: Goal alignment — checking session work against directives
  • Lesson 215: Decision replay — reconstructing causal chains
  • Lesson 216: The 1:1 protocol — closing the feedback loop

The through-line: behavioral observability is not a single tool. It is a stack where each layer depends on the previous one. Traces feed baselines. Baselines enable drift scoring. Drift scores inform alignment checks. All of it feeds the 1:1 review. The 1:1 review produces parameter changes that improve the traces in the next cycle.

The compound effect of this loop — behavioral data collected, analyzed, reviewed, and applied — is how an agent fleet improves over time rather than degrading silently.

Lesson 216 Drill

Run a manual 1:1 with your most important agent today. Use the four standard questions. Pull its most recent session traces and add one personalized question based on what you observe. Write down the response you would expect based on what you know about the agent's recent performance.

Then compare what you expected to what the agent actually says. The gap between expectation and response — in either direction — is signal. Build the automated protocol around narrowing that gap until the reviews produce no surprises. That is when you know the observability stack is working.