AI / LLM Integration
LLM cost budgets: multiple-choice review
Six questions that cut across the whole unit. Each mirrors a call you make on a real cost incident — not a definition to recite, but a tradeoff to weigh while the meter is running.
Confirm you can connect token pricing, re-sent context, routing, caching, and in-process budgets into one decision — the synthesis the overview lesson built toward.
A support chatbot on Sonnet 4.6 ($3/M in, $15/M out) sends a 200-token question and gets a 1,500-token answer, mostly chain-of-thought the user never sees. Where is the spend, and what is the first lever?
A 50-turn chat re-sends a 4,000-token system prompt every turn, and the input bill is climbing. Which fix has the highest leverage?
A team routes the easy 80% of requests to Haiku ($1/$5) and escalates failures to Opus ($5/$25). After launch the bill barely moved. Most likely cause?
An autonomous agent loop with no iteration cap runs overnight and bills $4,300. Why didn't the $1,000/month spend cap stop it?
Why does an uncapped agent loop cost superlinearly in the number of iterations, not just linearly?
You are designing cost controls for an LLM feature. Which ordering — from cheapest first-line defense to last-resort — reflects the unit's priority?
The through-line is one decision tree: output costs ~5x input so cap it first; the stateless model re-sends the system prompt, history, and RAG every turn so cache the stable prefix and trim the volatile parts; route the easy majority cheap and watch the escalation rate; and because a runaway loop costs superlinearly while a monthly cap is measured in days, the real brake is an in-process budget plus a kill switch on cost velocity. Every control reduces or bounds re-sent context and output before it bounds the bill.