AI / LLM Integration
Prompt caching: multiple-choice review
Six questions that cut across the whole unit. Each mirrors a real decision — where to put the breakpoint, which TTL to buy, why the invoice tripled — not a definition to recite.
Confirm you can connect token-for-token prefix matching, the write/read economics, TTL choice, and the silent failure mode that the lesson built toward.
Two requests carry the same 30k-token system prompt and document, but request B has one extra space inside that block. How much of B reads from cache?
A prefix is written once (1.25x, default tier) then read twice (0.1x each) before expiry. Versus paying full rate (1.0x) three times, did caching help?
A RAG service has a 30k stable prefix but bursty traffic: requests cluster, then go quiet for ~15 minutes. Which caching setup is right?
Cache hit rate fell to near zero after a deploy. No errors fired and outputs look correct. Most likely cause?
On Sonnet you cache a 600-token system prompt and see no read discount on repeated requests, with no error. Why?
You have a long layered prompt: tools, a static system block, then a large document that changes once a day. How should you place the up-to-4 cache breakpoints?
The unit’s through-line is one rule with an economic engine behind it: the match is token-for-token from position zero, so stable content (tools, then system, then large context) goes first and volatile content (timestamps, retrieved docs, the user’s question) goes last, with the breakpoint on the final unchanging block. You pay 1.25x once to write and 0.1x on every read, so caching wins after the first re-read inside the TTL — 5 minutes by default, 1 hour for bursty gaps. Below the model’s minimum cacheable length nothing caches, silently. The production failure mode is always the same: a byte near token zero poisons the prefix, flips every request from a 0.1x read to a 1.25x write, and the only signal is the usage block.