Ask Knox

The first time you build a multi-agent system and leave it running overnight, you will either get lucky or you will open your billing dashboard to a number that makes your stomach drop.

This lesson is about why that happens and how to prevent it.

The Anatomy of Runaway Spend

Agent spend goes wrong in predictable ways. Not random ways — predictable ones. Understanding the failure modes is the first step toward designing a system that survives contact with production.

The Loop of Death. An agent enters a reasoning loop where it cannot make progress. It calls the LLM to try again, gets a similar response, calls again. This can run for hundreds of turns before anyone notices. If each turn costs $0.05, a 200-turn loop costs $10. If five agents hit this simultaneously, that's $50 while you sleep.

The Wrong Model. A developer tests with claude-haiku (cheap, $1.00/MTok input) and deploys with claude-opus ($5.00/MTok input) — a 5x price multiplier. With identical token usage, a system that cost $2/day in testing costs $10/day in production. This is not hypothetical. It is a class of mistake that has happened at companies with entire AI infrastructure teams.

The Context Accumulation Bug. Agents that maintain a growing context window do not have linear cost curves — they have quadratic ones. An agent whose context doubles each session costs four times as much per session, not twice. A 10x context growth is a 100x cost growth at the token level.

The Retry Storm. An agent encounters an error and retries. The retry logic has a bug — it does not back off, it does not have a maximum, it does not distinguish between transient and permanent failures. It retries in a tight loop, spending tokens on every attempt, never succeeding.

The Cron Miscalculation. A cron job that should run hourly is configured to run every minute. Sixty times the intended spend, invisible until you check the dashboard the next morning.

Each of these failure modes has a real cost attached. The cost is not just money — it is trust. If the team that approved your agent platform sees a $200 weekend charge, you will spend the next month justifying every AI investment.

Real Numbers

Before building any budget system, ground yourself in the actual pricing. Here is the agent-broker cost model, straight from the source:

MODEL_PRICING_USD_PER_MTOK = {
    "claude-haiku-4-5-20251001": (1.00, 5.00),
    "claude-sonnet-4-6": (3.00, 15.00),
    "claude-opus-4-8": (5.00, 25.00),
    "gemini-2.0-flash": (0.00, 0.00),
}

The tuple is (input_price_per_mtok, output_price_per_mtok). Input tokens are what you send to the model — the system prompt, conversation history, tool definitions. Output tokens are what the model generates — the response. Output is more expensive than input on every paid model.

Working through a concrete scenario: an expert panel agent runs a 50-turn analysis session. Each turn sends 8,000 input tokens (system prompt + history + tools) and generates 2,000 output tokens. On claude-sonnet-4-6:

Per turn:
  input:  (8,000 / 1,000,000) × $3.00  = $0.024
  output: (2,000 / 1,000,000) × $15.00 = $0.030
  total:  $0.054

50-turn session: $0.054 × 50 = $2.70

That is a reasonable cost for a legitimate expert panel session. Now imagine that session runs in a loop — an agent that cannot terminate its reasoning and keeps issuing turns. At turn 100, the session has cost $5.40. At turn 200, $10.80. At turn 500, $27.00. A single looping agent can exhaust the entire day's global budget ceiling.

Why FinOps Is Infrastructure, Not Accounting

The typical response to this problem is "we'll add monitoring." Monitoring is necessary but insufficient. Monitoring tells you what happened. FinOps prevents it from happening.

The billing dashboard is where you discover yesterday's problem. The call site is where you prevent tomorrow's. Every LLM call in a well-designed agent system passes through a cost tracking layer that:

Records the call with full attribution — which agent, which session, which model
Updates running spend counters in real time
Checks the updated spend against configured budgets
Blocks or warns before the next call, not after

This is the agent-broker approach. The CostTracker class intercepts every LLM call. There is a linter rule that catches direct anthropic.messages.create() calls that bypass the tracker. No agent gets to call the LLM without going through the cost tracking layer first.

The Three Budget Layers

A production agent FinOps system needs three distinct layers of budget enforcement:

Per-agent daily budgets. Each agent has a spending allowance calibrated to its role. An analytics agent that runs five times per day has different cost expectations than a cron job that sends a simple status ping. Calibrated budgets catch anomalies — if Foresight is spending three times its daily allocation, something is wrong with the session, not the budget.

Warning thresholds. At 80% of any budget, fire a warning. The session continues, but the operator knows. This prevents hard stops from being the first signal of a problem.

Global daily ceiling. A hard ceiling across all agents. If the entire fleet has burned through $25 in a day, everything stops regardless of individual agent allocations. This is the last line of defense against correlated failures — if three agents hit loops simultaneously, the global ceiling contains the damage.

The agent-broker implementation has all three layers. The global ceiling is $25.00. Individual agent budgets range from $3.00 for OpenClaw down to $0.25 for small utility agents like the CFO reporter and the doc-sync agent.

DAILY_BUDGETS_USD = {
    "openclaw": 3.00,
    "expert-panel": 2.00,
    "content-pipeline": 2.00,
    "analyst-council": 1.50,
    "vp-trading": 1.00,
    "foresight": 1.00,
    "sports-bot": 0.75,
    "signal-engine": 0.75,
    # ... 20 agents total
}

GLOBAL_DAILY_CEILING = 25.00
ALERT_AT_PCT = 0.80

Adding these up: if every agent runs at maximum allocation, the system spends at most $16.50/day in agent budgets against a $25 global ceiling (see the “Per-Agent Daily Budgets” lesson for the full 20-agent map). The gap is intentional — it accounts for overrides during incidents and gives headroom for legitimate high-utilization days without triggering the global stop.

The Attribution Problem

Budget enforcement without attribution is theater. If you can see that $18 was spent today but cannot see which agents spent it, on which sessions, using which models, you cannot improve the system. You can only watch the number and hope.

The CostRecord data model captures everything needed for root cause analysis:

class CostRecord:
    record_id: str      # UUID for this specific call
    agent_id: str       # which agent made the call
    session_id: str     # which conversation session
    model: str          # which model was used
    input_tokens: int   # input token count
    output_tokens: int  # output token count
    cost_usd: float     # calculated cost at record time
    timestamp: str      # UTC ISO timestamp

This is the event log that makes FinOps operational. When the nightly CFO report shows an anomalous spend day, you query the records by agent_id to find the culprit. You query by session_id to find the loop. You query by model to find the tier violation.

Attribution is not an analytics feature — it is the prerequisite for every other FinOps capability.

Why Gemini Is Free

You may have noticed gemini-2.0-flash is priced at (0.00, 0.00) in the pricing table. This is deliberate — it reflects Google's free tier for Flash — but it carries a caveat: $0 holds only while you stay within the free tier's rate limits. The free tier is quota-capped, and Flash's paid tier bills on the order of $0.10/$0.40 per MTok (input/output). If your Tier 0 traffic outgrows the free quota, a pricing table that models Flash as permanently free will silently understate real spend. Treat (0.00, 0.00) as a free-tier assumption to revisit, not a constant — and keep the model reference current, since Google ships newer Flash generations regularly.

In a cost-optimized agent system, free models handle low-stakes work: routing decisions, simple classifications, status check responses, anything that does not require the reasoning depth of Sonnet or Opus. The model tier routing system (covered in the next lesson) automates this — agents that do not need premium models do not pay for them.

The temptation when a free model exists is to route everything through it. Resist this. Gemini Flash has limitations that matter for complex reasoning tasks. The right answer is to use the cheapest model that meets the quality bar for each task, not the cheapest model for all tasks.

What You Will Build

The next eight lessons in this track cover the specific components of a production FinOps system:

The next lesson — Model tier routing: Nano/Micro/Standard/Premium tiers, automatic downgrade logic, enforcement at the call site
Per-agent daily budgets the 80% warning, 100% stop, and global $25 ceiling, with the critical exception for incident response
Loop detection Jaccard similarity on turn content, warning at 3 consecutive similar turns, termination at 5
cost.attributed events the event schema, the emit callback pattern, and the nightly CFO report structure
Budget overrides with audit trail the admin REST endpoint, the Knox approval flow, and why the reason field is mandatory
Model tier routing cost-efficient agent architecture, prompt caching leverage, and the silent quota-error failure mode
Directive SLA enforcement governing your agent fleet with time-bound response contracts
The build-without-scheduling pitfall why agents that ship without a run schedule silently disappear

By the end of this track, you will have a complete blueprint for a FinOps layer that goes in front of every LLM call in your agent fleet. Not monitoring after the fact — enforcement at the call site.

The Weekend Test

Before moving to the next lesson, ask yourself this about every agent system you currently have running: if it entered a loop at 11pm Friday and ran until Monday morning, what would the charge be?

If you do not know the answer, or if the answer is uncomfortable, that is the gap this track closes.

The $200 Weekend Problem — Why AI Agents Need FinOps