Decision Replay: Reconstructing What Happened and Why
When something goes wrong in a multi-agent system, the hardest question is not 'what happened' — your logs can answer that. The hard question is 'why did that decision get made at that moment.' Decision replay reconstructs causal chains from traces and audit entries, giving you a timeline you can actually reason about.
A trading agent made a position sizing decision that was 40% larger than normal. The position moved against the portfolio. You want to understand why. You have logs. You have traces. You have audit entries. But none of them individually tell you the story — they tell you fragments. Decision replay is how you assemble fragments into a causal chain you can actually reason about.
The Replay Mental Model
Every significant event in a multi-agent system has a causal chain. A trade has a chain: signal received → signal evaluated → position sized → execution approved → order placed → outcome recorded. Each step has a reasoning trace. The steps are connected by a correlation_id that was threaded through the entire sequence.
Decision replay is the process of pulling that chain together from the trace database and audit log. The ReplayBuilder class is the tool that does it.
@dataclass
class ReplayChain:
"""A reconstructed causal chain."""
chain_type: str # trade | session | directive | incident
chain_id: str
events: list[dict] = field(default_factory=list)
traces: list[dict] = field(default_factory=list)
audit_entries: list[dict] = field(default_factory=list)
A ReplayChain is the output. It holds the chain type (what kind of event this was), the chain ID (the specific event you are investigating), and the reconstructed data: events, traces, and audit entries in chronological order.
class ReplayBuilder:
"""Builds replay chains from traces and audit log data."""
def __init__(self, tracer, audit_log):
self.tracer = tracer
self.audit_log = audit_log
The builder takes two dependencies: the AgentTracer (for querying reasoning traces) and the audit_log (for querying immutable audit entries). These two data sources capture different layers of the same events: traces capture agent-internal reasoning, audit entries capture inter-agent messages and system actions.
The Four Chain Types
Trade Chains
def build_trade_chain(self, correlation_id: str) -> ReplayChain:
"""Build causal chain for a trade: signal → eval → exec → outcome."""
chain = ReplayChain(chain_type="trade", chain_id=correlation_id)
# Get audit entries for this correlation
audit_entries = self.audit_log.query(correlation_id=correlation_id)
chain.audit_entries = sorted(
audit_entries, key=lambda e: e.get("created_at", "")
)
# Get reasoning traces filtered by correlation_id at DB level
chain.traces = self.tracer.query_traces(
correlation_id=correlation_id
)
return chain
A trade chain is anchored to a correlation_id — a UUID generated at the moment a trade signal is identified and threaded through every subsequent message and trace until the position is closed.
The trade chain contains two categories of evidence:
Audit entries — the record of every message exchanged between agents about this trade. The signal router sent the signal to the evaluation agent. The evaluation agent sent the evaluation result to the position sizing agent. The position sizing agent requested VP approval. All of these are in the audit log, sortable by timestamp.
Reasoning traces — the internal reasoning of each agent at each step. Why did the evaluation agent assign 0.71 confidence to this signal? What inputs did the position sizing agent have when it sized the position 40% larger than normal?
Together, these two data sources answer the post-mortem question. The audit entries show you the sequence of decisions. The traces show you the reasoning behind each decision.
The sorted(audit_entries, key=lambda e: e.get("created_at", "")) call is critical. In a multi-agent system, audit entries arrive from different agents that may have slight clock skew or processing delays. The audit log stores them in insertion order, which is not necessarily chronological order. Sorting by created_at reconstructs the actual temporal sequence.
Session Chains
def build_session_chain(self, session_id: str) -> ReplayChain:
"""Build chain for an entire session."""
chain = ReplayChain(chain_type="session", chain_id=session_id)
chain.traces = self.tracer.query_traces(session_id=session_id)
return chain
Session chains are the simplest form of replay. All traces from a single agent session, ordered by creation time. This is the primary tool for reviewing a specific session's behavioral data — used in the 1:1 review to examine the decisions that drove an unusual drift score.
Note that session chains contain only traces — no audit entries. The session is contained within one agent, so the inter-agent message layer is not relevant. The reasoning layer is the entire story.
Directive Chains
def build_directive_chain(self, directive_id: str) -> ReplayChain:
"""Build causal chain for a directive: issuance → execution → decisions → outcome."""
chain = ReplayChain(chain_type="directive", chain_id=directive_id)
# Audit entries where this directive appears in the correlation id
audit_entries = self.audit_log.query(
correlation_id=directive_id, limit=500
)
chain.audit_entries = sorted(
audit_entries, key=lambda e: e.get("created_at", "")
)
# Reasoning traces tied to directive execution
chain.traces = self.tracer.query_traces(
correlation_id=directive_id, limit=500
)
return chain
A directive chain reconstructs the full lifecycle of a specific directive: how it was issued, what messages it generated as the agent worked on it, what reasoning decisions it produced, and eventually how it was completed or escalated.
The limit=500 is larger than the default 100 because directives can span many sessions and generate significant trace volume. A long-running directive might produce hundreds of traces across multiple work sessions.
Directive chains are particularly useful for governance questions: "Did the agent actually complete this directive?" and "How long did it take?" and "What obstacles did it encounter?" The chain provides the evidence to answer each question.
Incident Chains
def build_incident_chain(self, incident_id: str) -> ReplayChain:
"""Build cross-agent timeline for an incident."""
chain = ReplayChain(chain_type="incident", chain_id=incident_id)
chain.audit_entries = self.audit_log.query(
correlation_id=incident_id, limit=500
)
return chain
Incident chains are the most important and the most austere. They contain only audit entries — the inter-agent message layer — because incidents are cross-agent events that are best understood through the communication timeline rather than any single agent's reasoning.
When a kill switch fires, when an agent escalates an emergency to the VP, when a safety breach is detected — these events generate audit entries from multiple agents across the system. The incident chain aggregates all of them into a single timeline.
When a post-mortem asks "why did the system respond this way?", the incident chain provides the evidence.
Threading correlation_id Through a Trade
The replay system only works if correlation_id is correctly threaded through the entire causal chain. This is an instrumentation discipline problem, not a framework problem. The ReplayBuilder can only surface data that was recorded with consistent correlation IDs.
Here is the instrumentation pattern for a trade lifecycle:
# When the trade signal is identified
trade_correlation_id = str(uuid.uuid4())
# Signal evaluation agent records its decision
tracer.decision(
agent_id="foresight",
decision_type="signal_evaluation",
rationale=f"Signal {signal_id} evaluated: high confidence bullish bias on {market}",
confidence=0.74,
inputs={"signal_id": signal_id, "market": market, "factors": factor_count},
session_id=session_id,
correlation_id=trade_correlation_id, # <-- the thread
)
# Position sizing agent, same correlation_id
tracer.decision(
agent_id="foresight",
decision_type="position_sizing",
rationale=f"Sizing at 1.4x normal based on elevated conviction from {n_confirming_signals} confirming signals",
confidence=0.68,
inputs={"base_size": base_size, "conviction_multiplier": 1.4},
session_id=session_id,
correlation_id=trade_correlation_id, # <-- same thread
)
Both traces share the same correlation_id. When the post-mortem runs build_trade_chain(trade_correlation_id), it retrieves both — plus all audit entries from any agent that handled messages containing that correlation ID. The full causal picture assembles automatically from properly threaded data.
A Post-Mortem Walkthrough
Consider the scenario from the opening: a position sized 40% larger than normal. The post-mortem workflow:
Step 1: Identify the trade's correlation ID from the outcome record in the trading database.
Step 2: Build the trade chain.
chain = replay_builder.build_trade_chain(correlation_id=trade_id)
Step 3: Read the audit entries in chronological order. This shows you the full message sequence: when the signal arrived, who evaluated it, what the evaluation said, when position sizing was invoked, whether VP approval was requested.
Step 4: Read the traces for the position sizing decision specifically. The decision_type == "position_sizing" trace will contain the rationale and inputs that explain the 1.4x multiplier. Was there a legitimate conviction_multiplier input? Were the confirming signals actually valid?
Step 5: Check the confidence on the position sizing trace. Was the agent highly confident when it sized large, or was it uncertain? High confidence on an incorrect large position suggests a calibration problem. Low confidence on the same decision suggests the agent knew it was pushing limits.
Step 6: Look at the outcome trace, if recorded. Was outcome_correct set to False? Did the record_outcome call happen?
What Replay Enables Beyond Post-Mortems
Post-mortem analysis is the most obvious use of replay. But the same infrastructure enables several other important operations:
Training data generation. Traces with outcome_correct = False are examples of incorrect reasoning. Traces with outcome_correct = True are examples of correct reasoning. These are gold-standard training data for prompt improvement — the rationale field shows you what the agent said, the outcome field shows you whether that reasoning produced the right result.
Pattern analysis. Querying for all traces with decision_type = "position_sizing" and grouping by confidence versus outcome correctness reveals calibration quality. Are high-confidence sizing decisions actually more correct than low-confidence ones? If not, confidence calibration needs work.
Compliance documentation. For any decision that requires documented rationale — a large position, an escalation to Knox approval, a safety override — the trace provides the immutable record of what reasoning produced that decision. The outcome IS NULL check in the immutability trigger ensures the record cannot be revised after the fact.
1:1 preparation. The OneOnOneProtocol (Lesson 216) pulls traces with incorrect outcomes to generate targeted review questions. The trace rationale becomes the basis for "here is a decision that produced a wrong outcome — what do you see now that you did not see then?"
The Correlation ID Discipline
Replay is only as useful as your correlation ID threading. The most common failure mode: correlation IDs generated correctly but not passed through to all downstream operations.
A checklist for ensuring correct threading:
- Generate
correlation_idonce per trade/directive/incident at the initiating agent - Pass it explicitly as an argument to every downstream function that makes a traced decision
- Include it in every A2A message related to this causal chain
- Never reuse correlation IDs across unrelated events
The ReplayBuilder has no way to verify that a correlation chain is complete. It returns what was recorded with that correlation ID. If a step in the chain was not instrumented with the shared ID, it is invisible to the replay — which is worse than having no replay, because you will draw conclusions from an incomplete picture.
The instrumentation discipline — threading correlation IDs, recording rationale, capturing outcome correctness — is the sweating in peace. When you need replay for a post-mortem, you will be grateful for every trace that was recorded with the right correlation ID.
Lesson 215 Drill
Pick one decision your most important agent made last week. Trace its correlation ID through every system it touched: which agents handled messages with that ID? What traces exist for that correlation? What does the rationale field say?
If you cannot reconstruct the causal chain, you have found your instrumentation gap. Close it this week — before the next significant event that requires a post-mortem.