AI / LLM Integration AI · 01 · 07

Prompt caching: multiple-choice review

Multiple-choice synthesis across the prompt-caching unit: prefix matching, write/read economics, TTL choice, ordering rules, and silent prefix poisoning.

AI Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each mirrors a real decision — where to put the breakpoint, which TTL to buy, why the invoice tripled — not a definition to recite.

Goal

Confirm you can connect token-for-token prefix matching, the write/read economics, TTL choice, and the silent failure mode that the lesson built toward.

Quiz

Two requests carry the same 30k-token system prompt and document, but request B has one extra space inside that block. How much of B reads from cache?

Quiz

A prefix is written once (1.25x, default tier) then read twice (0.1x each) before expiry. Versus paying full rate (1.0x) three times, did caching help?

Quiz

A RAG service has a 30k stable prefix but bursty traffic: requests cluster, then go quiet for ~15 minutes. Which caching setup is right?

Quiz

Cache hit rate fell to near zero after a deploy. No errors fired and outputs look correct. Most likely cause?

Quiz

On Sonnet you cache a 600-token system prompt and see no read discount on repeated requests, with no error. Why?

Quiz

You have a long layered prompt: tools, a static system block, then a large document that changes once a day. How should you place the up-to-4 cache breakpoints?

Recap

The unit’s through-line is one rule with an economic engine behind it: the match is token-for-token from position zero, so stable content (tools, then system, then large context) goes first and volatile content (timestamps, retrieved docs, the user’s question) goes last, with the breakpoint on the final unchanging block. You pay 1.25x once to write and 0.1x on every read, so caching wins after the first re-read inside the TTL — 5 minutes by default, 1 hour for bursty gaps. Below the model’s minimum cacheable length nothing caches, silently. The production failure mode is always the same: a byte near token zero poisons the prefix, flips every request from a 0.1x read to a 1.25x write, and the only signal is the usage block. Now when you see input spend climb without a traffic change, open the usage block first — if cache_creation is high and cache_read is near zero, something near token zero shifted.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.