AI / LLM Integration
Prompt caching: measure the savings
Reading about cache hit rate is not the same as moving the number. Build a service with a large stable prefix, instrument it honestly, turn caching on, and prove what it bought you — in input dollars and in time-to-first-token — with evidence at every step.
Turn the unit’s mental model into a reproducible loop: order the prompt for caching, read the usage block to compute a real cache-hit rate, defend prefix stability with a test, and verify the cost and latency win with before/after numbers under identical load.
Take an LLM service with a large reused prefix (a RAG or agent app with a big system prompt + documents) and add prompt caching, then prove it cut input cost and TTFT without changing outputs — measuring the cache-hit rate, cost, and latency before and after.
- A before/after table: cache-hit rate, input cost per 1k requests (write/read priced separately), and p50/p99 TTFT — all measured under the same load, not estimated.
- The usage block on a warm request shows cache_read_input_tokens covering the full prefix and cache_creation near zero — proving reads, not repeated writes.
- Cached input cost on the stable prefix drops by roughly an order of magnitude (≈90%) versus the uncached baseline, and TTFT on warm requests measurably improves.
- A one-paragraph write-up of the TTL decision with the gap distribution that justifies it, and confirmation that outputs are byte-identical cached vs uncached.
- Add a deliberately-poisoned variant (inject 'Current time: {now}' at the top of the system prompt) and show, with the usage block, the hit rate collapsing to zero and the input bill jumping — then revert and re-confirm. Make the stability test catch it.
- Stack multiple breakpoints on a layered prompt (after tools, after system, after the document) and demonstrate that changing only the document re-writes just the tail while the tools+system prefix keeps reading at 0.1x.
- Add a cache-hit-rate panel to your dashboard from runtime metrics and an alert that fires when the rolling hit rate drops below a threshold — the missing signal that lets prefix poisoning hide for weeks.
- Compare a 5-minute and 1-hour TTL under a simulated bursty traffic pattern (clusters with >5-minute gaps) and quantify how many writes the 1-hour tier converts into reads, then compute whether the 2x write premium paid off.
This is the loop you will run on every caching change in production: order the prompt stable-first, place the breakpoint on the last unchanging block, read the usage block to compute a true cache-hit rate, defend the prefix with a byte-identical test, choose the TTL from your real gap distribution, and verify the cost and TTFT win with before/after numbers under identical load. Doing it once on your own service makes the production version — including catching the silent poisoner — muscle memory.