AI / LLM Integration
Prompt caching: how a stable prefix cuts input cost 10x
The dashboard said cache hit rate: 94%. The invoice said otherwise — input spend had tripled month over month for a workload whose traffic was flat. The culprit was four characters: someone had added Current time: {now} to the top of the system prompt “for context.” Every single request now had a unique prefix from token zero, so every request paid the cache-write premium and never once read from cache. The 94% was measuring a metric that no longer fired. A timestamp at the top of a prompt is one of the most expensive lines of code you can ship.
What actually gets cached: a prefix, matched token-for-token
Prompt caching is not semantic. The provider does not understand that two prompts “mean the same thing.” It hashes the prefix of your request — the leading run of tokens — and looks for a stored entry that matches it exactly, token for token, from position zero. The match must be identical up to and including a marked boundary called a cache breakpoint. The instant the streams diverge — a different word, an extra space, a reordered block — the match ends there, and everything from that point on is processed (and billed) as fresh input.
This is why the order of your prompt is the whole game. The cacheable thing is always the front. Anthropic’s request hierarchy is fixed: tools come first, then system, then messages. So the design rule writes itself: put everything stable at the front (tool definitions, the system prompt, a large document, few-shot examples) and everything volatile at the back (the user’s actual question, per-request data, anything with a timestamp). You set the breakpoint on the last block that is identical across requests. Set it one block too late — on something that changes — and the lookback finds no prior write at that position, so you get a clean miss every time while still paying to write.
The economics: a 1.25x toll for a 10x discount
Caching is a bet, and the odds are printed on the box. Writing to the cache costs more than a normal request: the 5-minute tier bills the cached tokens at 1.25x the base input rate, and the 1-hour tier at 2x. Reading from the cache is where you win: a hit bills those tokens at 0.1x — a 90% discount. Concretely, on Opus 4.7 the standard input rate is $5.00 / MTok and a cache hit drops it to $0.50 / MTok; on Sonnet 4.6 it goes $3.00 to $0.30.
The break-even falls out of that ratio. You pay 1.25x once on the write, then 0.1x on every subsequent read. The first cached request is a small loss; by the second request you are ahead, and the third onward is nearly free. The senior framing is not “should I cache” but “will this prefix be re-read inside the TTL window often enough to amortise the write premium.” A prefix that is written and then never re-read before it expires is strictly worse than not caching at all — you paid the 1.25x toll for nothing.
| Factor | Multiplier vs base input | What it means |
|---|---|---|
| Cache write, 5-min TTL | 1.25x | One-time premium to store the prefix; default tier |
| Cache write, 1-hour TTL | 2x | Costs more to write but survives 12x longer |
| Cache read (hit) | 0.1x | 90% discount; the entire reason caching exists |
| Uncached input | 1.0x | Anything after the divergence point pays full rate |
TTL: 5 minutes by default, 1 hour if you pay up front
The cache entry is ephemeral. The default time-to-live is 5 minutes, and crucially the clock resets on every read — an active prefix stays warm as long as it keeps getting hit. The 1-hour tier exists for slower or bursty traffic where requests arrive more than 5 minutes apart but still within the hour. The trade is direct: 1-hour costs 2x to write instead of 1.25x, so it only pays if the longer window converts misses into hits you would otherwise have lost. As of early 2026 the default sits at 5 minutes; teams that built their cost model assuming a longer default got a quiet bill shock when an entry they expected to survive between requests had already expired.
There is also a floor: a prefix below the model’s minimum cacheable length is silently not cached — no error, just full-rate billing. The threshold depends on the model.
| Model | Minimum cacheable tokens |
|---|---|
| Opus 4.7 / 4.6 / 4.5 | 4,096 |
| Sonnet 4.6 / 4.5, Opus 4.1 | 1,024 |
| Haiku 4.5 | 4,096 |
| Haiku 3.5 | 2,048 |
Why this works
You get at most 4 explicit cache_control breakpoints per request, and each searches back a limited window for a prior write. This is why people stack breakpoints on a long, layered prompt — one after the tools, one after the static system block, one after a big document — so a change low down only invalidates the tail, not the expensive top. The breakpoints are nested prefixes, not independent caches: a hit at a later breakpoint implies the earlier ones matched too.
The production failure mode: silent prefix poisoning
The dangerous part is that breaking the cache never throws. Nothing in the response says “you missed.” Output is byte-identical whether you hit or paid full rate; the only signal lives in the usage block, in cache_read_input_tokens versus cache_creation_input_tokens. So a single careless edit can take a workload from 90% reads to 0% reads and the only symptom is the invoice three weeks later.
The classic poisoners are all things that feel harmless: a timestamp or request id injected at the top of the system prompt; tools serialized in a non-deterministic order (a Map or a dict that doesn’t preserve insertion order across runs); a library upgrade that re-pretty-prints your JSON tool schema with different whitespace; appending the user’s message before the static context instead of after. Each one shifts the prefix at or near token zero, so the entire cached body — possibly tens of thousands of tokens of system prompt and documents — re-bills at the 1.25x write rate on every call. The fix is discipline: freeze the prefix, assert it byte-for-byte in a test, and gate any change to system/tools behind a review that asks “does this move the cache breakpoint.”
A RAG service sends a 30k-token system prompt + retrieved docs, then the user question. Traffic is bursty: requests cluster, then go quiet for ~15 minutes. Pick the caching setup.
Your cache hit rate dropped to near zero after a deploy, but no errors appeared and outputs look correct. What is the most likely cause?
A prefix is written to cache (1.25x) and then read three times (0.1x each) before it expires. Versus paying full input rate (1.0x) four times, did caching save money?
Order a prompt from front (cached) to back (fresh each request) for maximum cache reuse:
- 1 Tool definitions (stable JSON schemas, deterministic order)
- 2 System prompt / role instructions (stable across requests)
- 3 Large static context or few-shot examples (the cache breakpoint goes here)
- 4 Per-request retrieved documents (change every call)
- 5 The user's actual question (most volatile, always last)
- 01Walk through exactly how a single injected timestamp at the top of a system prompt can 10x your input bill, and why no error fires.
- 02When is the 1-hour TTL worth its 2x write premium over the default 5-minute tier, and how do you reason about the break-even?
Prompt caching reuses an exact, token-for-token prefix at roughly 0.1x the base input price, but it costs a one-time premium to write — 1.25x for the default 5-minute tier, 2x for the 1-hour tier. The match is positional, not semantic, so the design rule is absolute: stable content (tools, then system, then large context and examples) goes first, volatile content (per-request data, timestamps, the user’s question) goes last, and the breakpoint sits on the final unchanging block. Below the model’s minimum cacheable length (1,024 to 4,096 tokens depending on the model) nothing is cached and you are billed at full rate with no warning. The production failure mode is silent prefix poisoning: a timestamp, a reordered tool schema, or a reformatted whitespace change near token zero invalidates the whole cached body, flipping every request from a 0.1x read to a 1.25x write with no error and identical output. The only signal is the usage block. Freeze the prefix, assert it byte-for-byte in tests, and the discount holds.