AI / LLM Integration
Prompt caching: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the mechanism stick.
Reconstruct the unit’s core mechanisms — token-for-token prefix matching, the write/read economics, TTL choice, the ordering rule, and silent prefix poisoning — without looking back at the lesson.
- 01Why is prompt caching positional rather than semantic, and what does that imply for prompt design?
- 02Walk through the write/read economics and how the break-even falls out.
- 03How does the TTL work, when does the 1-hour tier earn its 2x premium, and how do you reason about the break-even between tiers?
- 04What is the minimum cacheable length, and what is the dangerous thing about hitting it?
- 05Explain silent prefix poisoning: how a single careless edit 10x's the input bill with no error.
- 06What are the cache breakpoints, and why do people stack them on a long layered prompt?
If you could reconstruct each answer from memory, you hold the unit’s spine: matching is positional and token-for-token from position zero, so stable content goes first and volatile last with the breakpoint on the final unchanging block. You pay 1.25x once and 0.1x per read, so caching wins after the first re-read inside the TTL — 5 minutes by default, 1 hour for bursty gaps. Below the model’s minimum cacheable length nothing caches, silently. And the production failure mode is always prefix poisoning near token zero, visible only in the usage block.