AI / LLM Integration AI · 05 · 01

LLM cost budgets: token asymmetry, routing, and the kill switch

Output tokens cost ~5x input, and the system prompt + history + RAG context get re-sent every turn — so a runaway loop burns thousands overnight. Routing, caching, and hard per-request budget caps keep spend bounded.

AI Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Friday night, an agent ships behind a feature flag. It has a tool loop — observe, think, act, repeat — and no iteration cap. A document it fetches references another document, which references another. The agent keeps appending each one to its context and re-sends the whole growing transcript to the model every turn. By Monday the bill shows $4,300 for one “task” that never finished. The monthly $1,000 spend alert fired — at 2am, into a Slack channel nobody reads on weekends.

In ten minutes you’ll know exactly why the loop accelerated, why the monthly cap couldn’t stop it, and which levers actually keep the bill bounded.

Output tokens are the expensive half

The first thing that surprises engineers reading their first invoice: input and output are not priced the same. Across current frontier models, output runs roughly 5x the input rate. As of May 2026 Anthropic lists Haiku 4.5 at $1/M input and $5/M output, Sonnet 4.6 at $3/$15, and Opus 4.7 at $5/$25. OpenAI’s GPT-4o sits at $2.50/$10 — the same 4x asymmetry.

The lever this hands you is concrete: a verbose answer costs far more than a verbose question. Capping max_tokens, asking for terse or structured output, and stopping the model from “thinking out loud” in the final response all attack the expensive side of the ledger. A senior reviewing a cost spike checks output volume first, not input.

Model (May 2026)	Input /M	Output /M	Output : input ratio
Claude Haiku 4.5	$1.00	$5.00	5x
Claude Sonnet 4.6	$3.00	$15.00	5x
Claude Opus 4.7	$5.00	$25.00	5x
OpenAI GPT-4o	$2.50	$10.00	4x
GPT-4o mini	$0.15	$0.60	4x

Where the money actually accumulates

Before you reach for a cheaper model or a higher cap, ask yourself: how many times is each token actually sent? The answer is where the real bill lives. The bill is built from the fact that LLMs are stateless: every turn re-sends the entire context. Three things quietly inflate that re-sent payload.

The system prompt is sent on every single turn. A 4,000-token system prompt in a 50-turn conversation is not 4,000 tokens — it is 200,000 input tokens, paid 50 times. The conversation history grows linearly: turn 20 carries turns 1 through 19 with it, so cost per turn climbs as the chat goes on, and a long session is quadratic in total. RAG context is the heaviest single addition — stuffing 30 retrieved chunks (say 15,000 tokens) into every request, when 3 would have answered, is a 5x input cost multiplier that no one sees until the invoice.

This is why the runaway-loop postmortem is the canonical AI cost incident. An agent that appends a tool result to context each iteration, with no cap, doesn’t just make more calls — each call is bigger than the last. Cost grows superlinearly. Reported real cases include an agent making 14,000 redundant tool calls and developers watching $15 evaporate in under 10 minutes. The damage is done before a human ever looks.

▸Why this works

Why didn’t the monthly cap save the Friday-night agent? Provider spend limits operate at the billing-cycle level — they are a backstop measured in days, not seconds. A $1,000/month cap cannot stop a loop that burns $6 in 30 seconds; by the time the cycle total crosses the line, thousands are already spent. The brake has to live in-process, on cost velocity per request and per session, not in the billing dashboard.

Routing: don’t pay Opus prices for a Haiku job

Most requests in a real product are not hard. Classification, extraction, short rewrites, routing decisions — a small cheap model handles these at a fraction of the cost. The senior pattern is a cascade: send the request to the cheap model first, and escalate to the expensive model only when a quality check fails or the task is detected as hard up front. Routing the easy 80% to Haiku ($1/$5) instead of Opus ($5/$25) is a 5x cut on that slice with no user-visible loss.

The tradeoff is real and worth naming: routing adds a classification step (latency, and sometimes its own model call), and a wrong route degrades quality silently. You pay for an escalation path — the cheap call plus the expensive retry costs more than going straight to the expensive model when the cheap one fails. Routing wins only when the cheap model succeeds often enough that the average lands well below the flat expensive rate. Measure the escalation rate; if it’s above ~30%, your router or your cheap model is wrong.

Prompt caching: stop paying for the unchanging prefix

The single biggest lever against re-sent context is prompt caching. You mark the stable prefix — the system prompt, tool definitions, a fixed RAG corpus — and the provider caches it. Anthropic’s cache reads cost 0.1x the base input rate (a 90% discount), while the write to populate the cache costs 1.25x (5-minute TTL) or 2x (1-hour TTL) the base rate. The arithmetic: a 5-minute cache pays for itself after a single hit; a 1-hour cache after the second.

That turns the “4,000-token system prompt sent 50 times” disaster into one full-price write plus 49 reads at a tenth of the price. Caching does not help the parts that change every turn (the latest user message, fresh retrieval), so the engineering move is to order your prompt cache-friendly: stable content first, volatile content last, so the cacheable prefix is as long as possible.

Pick the best fit

A chat product re-sends a 4,000-token system prompt every turn across 50-turn sessions, and the input bill is dominating spend. Pick the highest-leverage fix.

Quiz

Across current frontier models, how do output token prices compare to input?

Quiz

An autonomous agent loop with no per-request budget runs overnight and bills $4,300. Why didn't the $1,000 monthly spend cap stop it?

Order the steps

Order the cost controls from cheapest/first-line to last-resort kill switch:

1 Route the easy majority to a small cheap model; escalate only on failure
2 Prompt-cache the stable prefix so re-sent context drops to 0.1x
3 Cap max_tokens and trim history/RAG to what the task needs
4 Enforce a per-request and per-user budget in-process, before each call
5 Hard kill switch: trip the circuit breaker on cost velocity or loop count

Every request pays for re-sent system prompt + history + RAG plus the output it generates. Apply the levers cheapest-first: route the easy majority to a small model, cache the stable prefix at 0.1x, cap output and trim context, enforce a per-request budget in-process, and trip the kill switch on cost velocity.

Recall before you leave

01
Walk through exactly why a long agent loop costs superlinearly, not just linearly, with respect to the number of iterations.
02
Your input bill is dominating output, and tracing shows the same 4,000-token system prompt sent on every turn. What concrete steps cut this, and what's the order of leverage?

Recap

LLM spend is driven less by the headline per-token rate than by what gets re-sent every turn and how much output you generate. Output costs roughly 5x input across current models (Haiku 4.5 $1/$5, Sonnet 4.6 $3/$15, Opus 4.7 $5/$25, GPT-4o $2.50/$10), so cap output and ask for terse answers first. The real accumulator is context: a stateless model re-sends the system prompt, the whole growing history, and all RAG chunks on every call, which is why a long agent loop costs superlinearly and a runaway loop can burn thousands overnight. Route the easy majority to a cheap model and escalate only on failure; prompt-cache the stable prefix to drop re-sent context to 0.1x; trim history and tighten retrieval. Above all, enforce per-request and per-user budgets in-process with a hard kill switch on cost velocity — monthly provider caps are a backstop measured in days, far too slow to stop a loop spending dollars per second. Now when you see a cost spike, check output volume and re-sent context before you touch a cap — and put the brake in-process, not in the billing dashboard.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Grounded RAG ServiceA RAG demo that answers from a corpus is easy; a RAG service you'd trust in front of users is not. The hard part isn't retrieval, it's grounding: making the model say only what the retrieved text supports, attaching citations the reader can check, and proving with an eval set that the answers don't drift into confident fiction. You'll build the whole loop — chunk, embed, store, retrieve top-k, ground, cite, score — and feel exactly where it leaks.