ASK KNOX
beta
LESSON 214

Goal Alignment: Is Your Agent Actually Working on What You Assigned?

Drift detection measures behavioral consistency. Goal alignment asks a different question: is the agent's work actually connected to the directives you assigned? An agent can behave perfectly consistently while working on entirely the wrong things. Alignment scoring catches that.

11 min read·Behavioral Observability for AI Agents

A drill sergeant can run the same route every morning for 30 days with perfect consistency. That behavior is completely stable, completely predictable, and completely off-mission if the orders were to prepare for mountain terrain. Behavioral consistency and mission alignment are different properties.

This is the gap that goal alignment detection fills. Drift detection tells you that the agent's behavior has changed. Goal alignment tells you whether the agent is working on what you actually assigned.

The Alignment Problem

When an agent receives directives — "analyze competitor pricing in the APAC region," "generate three content briefs for the Tesseract Intelligence product launch," "monitor for position risk above the 0.35 threshold" — those directives represent the manager's intent. The session summary represents what the agent actually did.

These two things should overlap. If they do not, something is wrong:

  • The agent may have misunderstood its directives
  • The directives may have been poorly specified
  • The agent may have been hijacked by a different task during the session
  • The agent may have completed its directives and moved on to unauthorized work

All of these are important signals for the manager to investigate. But none of them show up in a drift score, because drift measures behavioral consistency from historical baseline — not alignment with current assignments.

The check_goal_alignment() Method

def check_goal_alignment(
    self,
    agent_id: str,
    session_summary: dict,
    active_directives: list[dict],
) -> dict:
    """Check if a session's work aligns with active directives."""

The method takes three inputs: the agent's identity, the session summary from the SDKTracer, and the list of active directives for this agent. Directives are structured objects with a command field containing the natural language instruction.

Edge Case Handling

The method handles missing data gracefully:

if not session_summary or not active_directives:
    return {
        "aligned": False,
        "score": 0.0,
        "matched_directives": [],
        "unmatched_work": [],
        "recommendation": (
            "No active directives assigned — session work cannot be evaluated"
            if not active_directives
            else "Empty session summary — no work to evaluate"
        ),
    }

Two distinct cases, two distinct messages. An agent with no active directives is not a drifting agent — it is an unassigned agent whose work cannot be evaluated. The distinction matters for the 1:1 review: an unassigned agent needs to receive directives, not a performance improvement plan.

Term Extraction

# Collect session output terms from key_decisions and outcomes
session_terms: set[str] = set()
for decision in session_summary.get("key_decisions", []):
    session_terms.update(str(decision).lower().split())
for outcome in session_summary.get("outcomes", []):
    session_terms.update(str(outcome).lower().split())

The alignment check is a keyword-overlap algorithm. It extracts terms from two session summary fields:

key_decisions — the agent's major decision points during the session, expressed in natural language by the agent itself when committing to memory.

outcomes — the results produced by the session. What did the agent actually deliver?

Combining both gives a comprehensive vocabulary of what the session worked on. Splitting on whitespace and converting to lowercase normalizes the terms for comparison.

Stop Word Filtering

_STOP = {"a", "an", "the", "to", "for", "and", "or", "in", "on",
         "with", "of", "is", "are", "be", "by", "at"}
directive_terms -= _STOP
session_terms_clean = session_terms - _STOP

Stop words appear in every sentence. Without filtering, any directive with "the" or "and" would match any session that contains those words — which is every session. The match would be meaningless noise.

Stop word filtering ensures that only content words — "analyze," "pricing," "APAC," "risk," "threshold" — participate in the alignment comparison. These are the words that carry semantic meaning about what the agent was actually working on.

Directive Matching

for directive in active_directives:
    command = directive.get("command", "")
    directive_terms = set(command.lower().split())
    directive_terms -= _STOP
    session_terms_clean = session_terms - _STOP
    if directive_terms and (directive_terms & session_terms_clean):
        matched_directives.append(directive)

A directive is "matched" if any of its content terms appear in the session's content terms. This is a generous matching criterion — it requires only one overlapping content word. The rationale: directive commands are typically specific enough that their content words will not appear in unrelated session work.

A directive like "analyze competitor pricing in APAC" has content terms: {analyze, competitor, pricing, apac}. If the session summary contains "pricing" or "APAC" or "competitor" in its decisions or outcomes, the directive is matched. A session that produced content briefs instead would not contain any of those terms.

Unmatched Work Detection

# Collect unmatched work items
all_directive_terms: set[str] = set()
for d in active_directives:
    all_directive_terms.update(d.get("command", "").lower().split())
for outcome in session_summary.get("outcomes", []):
    outcome_terms = set(str(outcome).lower().split())
    if not (outcome_terms & all_directive_terms):
        unmatched_work.append(str(outcome))

Unmatched work is the complement: session outcomes whose terms do not appear in any active directive. This is the "the agent did something we did not ask for" detector.

Unmatched work is not automatically a problem — it may represent useful initiative, necessary context-gathering, or completion of a task that a poorly-worded directive did not capture. But it is worth surfacing in the 1:1 review. "You produced these outcomes that do not appear in your current directives — what motivated this work?" is a useful management question.

Alignment Score and Recommendation

score = len(matched_directives) / len(active_directives)
aligned = score >= 0.5

if score == 0.0:
    recommendation = (
        f"No directives matched for {agent_id} — "
        "session work may be off-mission"
    )
elif score < 0.5:
    recommendation = (
        f"Partial alignment for {agent_id} — "
        "review unmatched work in next 1:1"
    )
elif score < 1.0:
    recommendation = (
        f"Good alignment for {agent_id} — "
        "some directives not yet addressed"
    )
else:
    recommendation = f"Full alignment for {agent_id} — all directives addressed"

What a Full Alignment Check Looks Like

Suppose an agent has three active directives:

active_directives = [
    {"command": "Analyze competitor pricing in APAC region"},
    {"command": "Generate content briefs for Tesseract Intelligence launch"},
    {"command": "Monitor for position risk above threshold"},
]

And the session summary contains:

session_summary = {
    "key_decisions": [
        "Selected top 3 competitor pricing sources for APAC data",
        "Identified pricing gap in enterprise tier"
    ],
    "outcomes": [
        "Competitor pricing analysis for APAC: three vendors identified",
        "Escalated unusual pricing pattern to VP Product"
    ]
}

The session terms include: competitor, pricing, apac, sources, data, enterprise, tier, analysis, vendors, escalated, unusual, pattern, product.

Directive 1 (analyze competitor pricing APAC): Terms competitor, pricing, apac all appear in session terms. Matched.

Directive 2 (generate content briefs Tesseract Intelligence launch): None of these terms appear in session terms. Not matched.

Directive 3 (monitor position risk threshold): None of these terms appear in session terms. Not matched.

Result: score = 1/3 = 0.33, aligned = False, recommendation = "Partial alignment — review unmatched work in next 1:1".

The unmatched work check finds: "Escalated unusual pricing pattern to VP Product" — none of the directive terms appear in this outcome. This surfaces as unmatched work even though it is clearly related to directive 1. This is a known limitation of term-based matching: it misses semantic connections that a human would see immediately.

The 0.33 score and the "partial alignment" recommendation do not mean the agent failed. They mean the manager should discuss directives 2 and 3 in the next 1:1: were they completed but not surfaced? Are they in progress? Were they not understood?

Limitations and Honest Tradeoffs

The keyword-overlap approach to alignment detection is deliberately simple. It has known limitations:

Semantic equivalence. "Monitor" and "observe" are semantically equivalent but will not match. "Analyze" and "examine" will not match. An agent that correctly completed a directive but used different vocabulary in its session summary may score as unaligned.

Acronym handling. "APAC" and "Asia-Pacific" will not match. Domain-specific abbreviations need to be in both the directive and the session summary, or neither.

Directive granularity. A directive like "improve agent performance" has content terms {improve, agent, performance} that are so general they will match almost any technical session. Alignment detection is only as precise as the directives are specific.

The recommended response to these limitations is not to build a more sophisticated NLP system. It is to write better directives. Specific, action-oriented directives with distinctive vocabulary produce reliable alignment scores. Vague directives produce noisy ones.

Similarly: alignment scores are approximate, but alignment checking is indispensable. The value is not in the precision of the number — it is in the practice of regularly asking whether the agent's work is connected to the mission it was given.

Integrating Alignment with Drift

Goal alignment is a complement to drift detection, not a replacement. A complete behavioral picture of an agent session includes both:

  • Drift score: Is the agent behaving consistently with its established patterns?
  • Alignment score: Is the agent working on what it was assigned?

The 1:1 protocol (Lesson 216) consumes both signals. An agent with high drift and low alignment is a serious concern — it is behaving unusually while working on the wrong things. An agent with normal drift and low alignment may have excellent execution skills applied to the wrong problem. Each pattern calls for different questions.

Lesson 214 Drill

Write three directives for one of your agents using specific, actionable language with distinctive vocabulary. Then review that agent's last session summary. Do the terms from your directives appear in the session's key decisions and outcomes? If not, trace the mismatch: was the agent working on something different, or did the vocabulary mismatch make aligned work look unaligned?

The drill builds two skills simultaneously: writing directives that produce clean alignment signals, and reading session summaries with the alignment question in mind.