Crux Read real token-accounting code and pricing math, predict the spend, and pick the highest-leverage fix — cost arithmetic, per-tenant budgets, routing, and cache savings.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Cost incidents are diagnosed in token counters and pricing math, not in prose. Read the code and the numbers, then choose the fix a senior engineer would make first.
Goal
Practise the loop you run on every cost spike: read the call site, do the per-token arithmetic in your head, find where the budget leaks, and reach for the highest-leverage fix before raising any cap.
Which statement about this session's cost is correct?
Heads-up Per-token output is 5x, but there are only 20,000 output tokens (~$0.30) against ~935,000 input tokens (~$2.80). Volume beats rate here — the re-sent prefix and growing history dwarf the output.
Heads-up The system prompt is constant but history grows ~600 tokens/turn, so per-turn input climbs from 4,000 to ~33,400. The sum is quadratic in turns, not flat.
Heads-up A stateless model re-sends the full prefix every turn unless you prompt-cache it. Here it is paid all 50 times — 200,000 tokens — which is the prime caching target.
Snippet 2 — per-tenant budget enforcement
async function callModel(tenantId: string, req: Request) { const spent = await budget.getMonthlySpend(tenantId); // async read if (spent >= LIMIT[tenantId]) throw new BudgetExceeded(); const res = await llm.create(req); // bills tokens await budget.add(tenantId, costOf(res)); // record after return res;}
Quiz
Completed
This per-tenant budget guard has a flaw a burst of concurrent requests will expose. What is it, and what is the fix?
Heads-up await yields to the event loop — it does not serialise across concurrent requests. Multiple in-flight calls read the same pre-update spend and all pass the check. That's the classic check-then-act race.
Heads-up Recording after is necessary because you only know the true token cost from the response. The defect is the non-atomic read-check-write, not the timing of the record — pre-reserve an estimate, then reconcile.
Heads-up Per-tenant limits are correct for fair, isolated cost control. The flaw is concurrency-safety and the lack of a per-request cap, not the granularity of the limit.
Snippet 3 — model-routing decision
// Haiku $1/$5, Opus $5/$25 (per 1M in/out). Avg job: 1k in, 500 out.const opusOnly = 1000 * 5/1e6 + 500 * 25/1e6; // straight to Opus// Cascade: Haiku first; on failure, also pay Opus. p = escalation rate.function cascade(p) { const haiku = 1000 * 1/1e6 + 500 * 5/1e6; const opus = 1000 * 5/1e6 + 500 * 25/1e6; return haiku + p * opus; // always pay Haiku}
Quiz
Completed
opusOnly ≈ $0.0175/job. At what escalation rate p does the cascade stop saving money?
Heads-up At p = 0.5, cascade = 0.0035 + 0.5 × 0.0175 = $0.01225, still below the $0.0175 Opus-only cost. Break-even on cost alone is ~0.8; the operational threshold is lower but not 0.5.
Heads-up Only while p is low enough. Because you always pay the Haiku call plus a fraction of the Opus call, a high escalation rate makes the cascade more expensive than Opus-only — break-even here is p ≈ 0.8.
Heads-up p ≈ 0.2 still saves money on cost ($0.0035 + 0.2 × 0.0175 ≈ $0.007). The ~30% figure is a quality/latency alarm, not the cost break-even, which is ~0.8 for these numbers.
Snippet 4 — cache-savings calculation
// Anthropic: cache read = 0.1x base input; 5-min write = 1.25x base input.// Stable prefix = 10,000 tokens, base input $3/1M, reused across N turns.function prefixCost(N: number, cached: boolean) { const base = 10000 * 3/1e6; // full-price send if (!cached) return N * base; // re-sent every turn const write = 10000 * (3 * 1.25)/1e6; // one cache write const read = 10000 * (3 * 0.1)/1e6; // each cache read return write + (N - 1) * read;}
Quiz
Completed
For a 10,000-token prefix reused across 50 turns, roughly how much does caching save, and when does it pay off?
Heads-up 0.1x is the per-read rate, not the total saving. Across 50 turns you replace 50 full-price sends with one 1.25x write plus 49 reads at 0.1x — roughly an 88% reduction on the prefix, not 10%.
Heads-up The 1.25x write is paid once. Even by the second turn the write-plus-read total (~1.35x) beats two full-price sends (2x); over 50 turns the 0.1x reads dominate and you save ~88%.
Heads-up Prompt caching discounts re-sent input — the stable prefix. Output is generated fresh each turn and is never cached, so the saving is entirely on the repeated input prefix.
Recap
Every cost decision is read in counters and arithmetic: a re-sent prefix plus growing history makes a long session quadratic in input; a per-tenant budget guard must be atomic (reserve/commit) and paired with a per-request cap or concurrency races blow the limit; a Haiku→Opus cascade only saves while the escalation rate stays low (cost break-even ~0.8, operational alarm ~30%); and prompt-caching a 10k prefix over 50 turns cuts ~88% and pays off on the first hit. Do the math first, then fix the biggest term — usually re-sent context — before touching a cap.