AI / LLM Integration
LLM cost budgets: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the cost model stick.
Reconstruct the unit’s spine — token asymmetry, where context accumulates, routing economics, prompt caching, and the in-process kill switch — without looking back at the lesson.
- 01Why is output the expensive half of an LLM bill, and what concrete levers attack it?
- 02A stateless model re-sends context every turn. Name the three things that inflate the re-sent payload and how each grows.
- 03When does model routing (cheap-first cascade) actually save money, and when does it backfire?
- 04Explain prompt caching: what gets discounted, by how much, and how do you structure a prompt to maximise the benefit?
- 05Why does an uncapped agent loop burn money superlinearly, and why can't a monthly provider cap stop it?
- 06List the LLM cost controls in priority order, cheapest first-line to last-resort, and say what each one bounds.
If you could reconstruct each answer from memory, you hold the unit’s spine: output costs ~5x input so cap it; the system prompt, history, and RAG all re-send every turn (fixed, linear, and multiplicative respectively); routing saves only at a low escalation rate; caching the stable prefix drops it to 0.1x and pays off on the first hit; and because a runaway loop is superlinear while a monthly cap is measured in days, the real brake is an in-process budget plus a kill switch on cost velocity.