ASK KNOX
beta
LESSON 211

Reasoning Traces: Recording What Your Agent Actually Did

A reasoning trace is not a log line. It is a structured record of why an agent made a decision — the rationale, the confidence, the inputs, and eventually the outcome. Get the schema right and you unlock drift detection, replay, and performance reviews. Get it wrong and you have expensive noise.

12 min read·Behavioral Observability for AI Agents

Your agent made 40 decisions today. You know they happened because the outputs exist. What you do not know — unless you instrumented this explicitly — is why each decision was made, how confident the agent was, what inputs it was working from, and whether the outcome was correct.

Without that data, drift detection is guesswork. Post-mortems are reconstruction from fragments. Performance reviews are based on output samples, not decision patterns. The behavioral observability stack starts here, with the reasoning trace.

The ReasoningTrace Schema

The trace schema in Principal Broker's observe.db is deliberate. Every field earns its place.

-- From broker/observability/tracer.py
CREATE TABLE IF NOT EXISTS reasoning_traces (
    trace_id TEXT PRIMARY KEY,
    agent_id TEXT NOT NULL,
    session_id TEXT,
    correlation_id TEXT,
    decision_type TEXT NOT NULL,
    rationale TEXT NOT NULL,
    confidence REAL,
    inputs_json TEXT,
    outcome TEXT,
    outcome_correct INTEGER,
    created_at TEXT NOT NULL
);

Let us walk through each field and why it matters.

trace_id — A UUID generated at record time. Immutable. The primary key for looking up or updating a specific trace.

agent_id — Which agent made this decision. Not the session. Not the task. The specific agent identity. This is what allows you to build per-agent baselines rather than fleet-wide baselines.

session_id — Which session this trace belongs to. Sessions are the unit of behavioral analysis. Grouping traces by session is what lets you compute session-level metrics: total decisions, average confidence, escalation count.

correlation_id — The trace's membership in a causal chain. A single trade might involve traces from Foresight evaluating the signal, the political events prediction agent confirming execution context, and the VP Trading agent approving position size. All three share a correlation_id. This is the thread that makes replay possible.

decision_type — A categorical label for what kind of decision this was. Examples: signal_evaluation, position_sizing, escalation_decision, content_approval. Decision types are the vocabulary of behavioral analysis — the distribution of types across sessions is one of the signals the drift detector watches.

rationale — The agent's stated reasoning. This is the most important field for human review. Not a summary of inputs. Not a description of the output. The actual reasoning chain the agent used. When a trace surfaces in a 1:1 review, this is what the manager reads to understand what the agent was thinking.

confidence — A normalized float from 0.0 to 1.0. How certain was the agent about this decision? Confidence calibration across sessions is one of the three primary drift signals. Agents that were reliably high-confidence on certain task types but are now expressing lower confidence are showing a behavioral change worth investigating.

inputs_json — The serialized inputs the agent had at decision time. What data, what context, what prior outputs fed into this decision? Not always needed for every trace, but invaluable when you need to replay a decision and understand whether the reasoning was appropriate given the inputs.

outcome — What actually happened as a result of this decision. Not always knowable immediately. Often populated hours or days later when the result becomes observable.

outcome_correct — Boolean judgment: was this decision correct? NULL when unknown, 1 for correct, 0 for incorrect. The accuracy rate derived from this field is one of the most direct measurements of agent performance over time.

created_at — UTC ISO timestamp. All behavioral time-series analysis depends on accurate timestamps.

The AgentTracer: Server-Side Recording

The AgentTracer class in broker/observability/tracer.py is the server-side implementation — used by the broker to record traces for agents it manages. It writes directly to SQLite with WAL mode enabled for concurrent reads.

class AgentTracer:
    def __init__(self, db_path: str | Path):
        self.db_path = Path(db_path).expanduser()
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        self._conn = sqlite3.connect(
            str(self.db_path), check_same_thread=False
        )
        self._conn.row_factory = sqlite3.Row
        self._conn.execute("PRAGMA journal_mode=WAL")
        self._conn.executescript(SCHEMA_SQL)
        self._conn.commit()

WAL (Write-Ahead Logging) mode is the right choice here. The drift detector and replay builder read traces while new ones are being written. WAL allows concurrent readers without blocking writers — a property you need in a busy agent fleet.

The check_same_thread=False setting is intentional. The tracer is used from multiple threads in an async FastAPI application. SQLite with WAL mode handles this safely.

The decision() Method: Fire-and-Forget

def decision(
    self,
    agent_id: str,
    decision_type: str,
    rationale: str,
    confidence: float | None = None,
    inputs: dict | None = None,
    session_id: str | None = None,
    correlation_id: str | None = None,
) -> str:
    """
    Record a decision trace. Returns trace_id.
    This should be called fire-and-forget in async code.
    """
    trace_id = str(uuid.uuid4())
    now = datetime.now(timezone.utc).isoformat()

    self._conn.execute(
        """INSERT INTO reasoning_traces
            (trace_id, agent_id, session_id, correlation_id,
             decision_type, rationale, confidence, inputs_json,
             created_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            trace_id, agent_id, session_id, correlation_id,
            decision_type, rationale, confidence,
            json.dumps(inputs) if inputs else None,
            now,
        ),
    )
    self._conn.commit()
    return trace_id

The docstring says "fire-and-forget in async code" — this is the cardinal rule of behavioral instrumentation.

If recording a trace could block an agent's decision loop, two things happen. First, you have introduced latency into the agent's primary task. Second, and more subtly, you have created a scenario where the agent behaves differently when being observed versus when observation fails — which invalidates your baseline data.

In practice, calling decision() in async agent code looks like this:

# In an async agent
asyncio.create_task(
    tracer.decision(
        agent_id=self.agent_id,
        decision_type="signal_evaluation",
        rationale=f"Evaluated signal {signal_id}: confidence {conf:.2f} based on {n_factors} factors",
        confidence=conf,
        inputs={"signal_id": signal_id, "market": market},
        session_id=self.session_id,
        correlation_id=trade_correlation_id,
    )
)

asyncio.create_task() schedules the trace write on the event loop without awaiting it. The agent continues immediately. The trace is recorded asynchronously. If the trace write fails (which should be rare with SQLite WAL), the failure is logged but does not propagate to the agent.

The record_outcome() Method

def record_outcome(
    self,
    trace_id: str,
    outcome: str,
    correct: bool | None = None,
) -> bool:
    """Update a trace with the observed outcome."""
    cursor = self._conn.execute(
        """UPDATE reasoning_traces
        SET outcome=?, outcome_correct=?
        WHERE trace_id=? AND outcome IS NULL""",
        (outcome, 1 if correct else (0 if correct is False else None), trace_id),
    )
    self._conn.commit()
    return cursor.rowcount > 0

Two details worth noting. First, the WHERE outcome IS NULL clause: you can only record an outcome once. This is enforced at the SQL level. There is also a trigger in the schema that raises an abort if you attempt to update an already-recorded outcome. Traces are append-only records — the immutability guarantee is what makes them trustworthy for post-mortem analysis.

Second, correct is genuinely ternary: True, False, or None. Not every outcome can be evaluated for correctness. A decision that led to an outcome that depends on future market movement might not be evaluable for weeks. None is the correct state until evaluation is possible.

The SDKTracer: Agent-Side Recording

The SDKTracer in principal_sdk/tracer.py is the lightweight version designed to run inside the agent itself. It does not write to SQLite — it stores traces in memory and publishes them to the broker.

class SDKTracer:
    def __init__(
        self,
        agent_id: str,
        broker_client=None,
    ):
        self.agent_id = agent_id
        self._client = broker_client
        self._traces: list[dict] = []
        self._trace_index: dict[str, dict] = {}

The dual storage — a list for ordered access, a dict for O(1) lookup by trace_id — is deliberate. record_outcome() needs fast trace_id lookup. get_session_traces() needs ordered access. Both are O(1) and O(n) respectively with this structure.

Session Summary for Memory Commits

def get_session_summary(self) -> dict:
    """Summarize traces for session memory commit."""
    total = len(self._traces)
    with_outcome = sum(
        1 for t in self._traces if t.get("outcome") is not None
    )
    correct = sum(
        1 for t in self._traces
        if t.get("outcome_correct") is True
    )
    incorrect = sum(
        1 for t in self._traces
        if t.get("outcome_correct") is False
    )

    return {
        "total_decisions": total,
        "with_outcome": with_outcome,
        "correct": correct,
        "incorrect": incorrect,
        "accuracy": (
            correct / with_outcome if with_outcome > 0 else None
        ),
        "decision_types": self._count_by_type(),
    }

This summary is what gets committed to the semantic memory layer at session end. It is also what the drift detector reads to compute the session's contribution to the behavioral baseline. Every field serves the downstream analysis:

  • total_decisions feeds the decision count drift signal
  • accuracy feeds the outcome quality metric
  • decision_types shows whether the agent's task mix has changed

The Immutability Guarantee

The schema includes a trigger that prevents updating a trace once its outcome is recorded:

CREATE TRIGGER IF NOT EXISTS prevent_trace_update
BEFORE UPDATE ON reasoning_traces
WHEN OLD.outcome IS NOT NULL
BEGIN
    SELECT RAISE(ABORT, 'Trace outcome already recorded');
END;

This is not pedantry. Immutability is what makes the trace database useful for audits and post-mortems. If traces could be overwritten after the fact, the behavioral record would be unreliable. You need to be able to say: "This is what the agent decided, this is why it decided it, and this is what happened. None of those facts have been altered."

The same principle that makes append-only audit logs trustworthy in financial systems makes append-only trace databases trustworthy for agent performance reviews.

Instrumentation Strategy

Not every agent decision needs to be traced. Some operations are too granular and would produce noise. The principle is:

A signal evaluation pipeline might have 15 computational steps — normalize data, apply filters, compute scores, combine outputs. One trace. One decision_type of signal_evaluation. One rationale that explains the conclusion and key factors. One confidence score.

What to trace:

  • Judgment calls — cases where the agent could have gone multiple ways
  • Escalation decisions — when the agent decides to surface something to a manager
  • High-stakes actions — anything with irreversible consequences
  • Classification outputs — when the agent categorizes or routes based on content

What not to trace:

  • Pure computation — mathematical transformations with no judgment involved
  • Data retrieval — fetching records from a database
  • Formatting operations — converting between output formats

The session summary metrics — decisions per session, average confidence, escalation rate — are meaningful only if you are tracing consistent decision points across sessions. If your tracing is inconsistent, your baselines are meaningless.

What the Data Enables

After 20 sessions of consistent tracing, the behavioral picture becomes clear:

# A session summary that feeds the drift detector
{
    "total_decisions": 14,
    "with_outcome": 8,
    "correct": 7,
    "incorrect": 1,
    "accuracy": 0.875,
    "decision_types": {
        "signal_evaluation": 8,
        "escalation_decision": 3,
        "position_sizing": 3
    }
}

The drift detector compares these numbers against the baseline. Confidence lower than usual? Watch level. Decision count half of normal? Alert level. Escalation rate doubled? Investigate.

The replay builder queries traces by session or correlation_id. The 1:1 protocol pulls incorrect traces to generate targeted questions. Every downstream behavioral analysis tool in the Principal Broker depends on this data being collected consistently, from session one.

Lesson 211 Drill

Take the most important decision your agent makes in a typical session. Write the decision() call for it: what is the decision_type? What does a good rationale look like? What fields belong in inputs? How would you determine correct when recording the outcome?

If you can answer those four questions for your most important decision type, the rest of the tracing schema follows naturally.