AI / LLM Integration AI · 05 · 09

LLM cost budgets: code and cost arithmetic

Read real token-accounting code and pricing math, predict the spend, and pick the highest-leverage fix — cost arithmetic, per-tenant budgets, routing, and cache savings.

AI Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Cost incidents are diagnosed in token counters and pricing math, not in prose. Read the code and the numbers, then choose the fix a senior engineer would make first.

Goal

Practise the loop you run on every cost spike: read the call site, do the per-token arithmetic in your head, find where the budget leaks, and reach for the highest-leverage fix before raising any cap.

Snippet 1 — per-turn cost arithmetic

// Sonnet 4.6: $3 / 1M input, $15 / 1M output
const PRICE = { in: 3 / 1e6, out: 15 / 1e6 };

// 50-turn session, 4,000-token system prompt re-sent every turn,
// history grows ~600 tokens/turn, each answer ~400 output tokens.
let inputTotal = 0;
for (let turn = 1; turn <= 50; turn++) {
  const history = 600 * (turn - 1);
  inputTotal += 4000 + history;        // system prompt + accumulated history
}
const outputTotal = 50 * 400;
const cost = inputTotal * PRICE.in + outputTotal * PRICE.out;

Quiz

Which statement about this session's cost is correct?

Snippet 2 — per-tenant budget enforcement

async function callModel(tenantId: string, req: Request) {
  const spent = await budget.getMonthlySpend(tenantId);   // async read
  if (spent >= LIMIT[tenantId]) throw new BudgetExceeded();
  const res = await llm.create(req);                        // bills tokens
  await budget.add(tenantId, costOf(res));                  // record after
  return res;
}

Quiz

This per-tenant budget guard has a flaw a burst of concurrent requests will expose. What is it, and what is the fix?

Snippet 3 — model-routing decision

// Haiku $1/$5, Opus $5/$25 (per 1M in/out). Avg job: 1k in, 500 out.
const opusOnly = 1000 * 5/1e6 + 500 * 25/1e6;             // straight to Opus
// Cascade: Haiku first; on failure, also pay Opus. p = escalation rate.
function cascade(p) {
  const haiku = 1000 * 1/1e6 + 500 * 5/1e6;
  const opus  = 1000 * 5/1e6 + 500 * 25/1e6;
  return haiku + p * opus;                                  // always pay Haiku
}

Quiz

opusOnly ≈ $0.0175/job. At what escalation rate p does the cascade stop saving money?

Snippet 4 — cache-savings calculation

// Anthropic: cache read = 0.1x base input; 5-min write = 1.25x base input.
// Stable prefix = 10,000 tokens, base input $3/1M, reused across N turns.
function prefixCost(N: number, cached: boolean) {
  const base = 10000 * 3/1e6;                 // full-price send
  if (!cached) return N * base;               // re-sent every turn
  const write = 10000 * (3 * 1.25)/1e6;       // one cache write
  const read  = 10000 * (3 * 0.1)/1e6;        // each cache read
  return write + (N - 1) * read;
}

Quiz

For a 10,000-token prefix reused across 50 turns, roughly how much does caching save, and when does it pay off?

Recap

Every cost decision is read in counters and arithmetic: a re-sent prefix plus growing history makes a long session quadratic in input; a per-tenant budget guard must be atomic (reserve/commit) and paired with a per-request cap or concurrency races blow the limit; a Haiku→Opus cascade only saves while the escalation rate stays low (cost break-even ~0.8, operational alarm ~30%); and prompt-caching a 10k prefix over 50 turns cuts ~88% and pays off on the first hit. Do the math first, then fix the biggest term — usually re-sent context — before touching a cap. Now when you read a token-accounting snippet on a real incident, you’ll know which number to compute first and which fix to reach for before raising any limit.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.