awesome-everything RU
↑ Back to the climb

AI / LLM Integration

Prompt caching: measure the savings

Crux Hands-on project — add prompt caching to a real LLM service, instrument hit rate and cost, and prove the before/after savings in input spend and TTFT.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 210 min

Reading about cache hit rate is not the same as moving the number. Build a service with a large stable prefix, instrument it honestly, turn caching on, and prove what it bought you — in input dollars and in time-to-first-token — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible loop: order the prompt for caching, read the usage block to compute a real cache-hit rate, defend prefix stability with a test, and verify the cost and latency win with before/after numbers under identical load.

Project
0 of 7
Objective

Take an LLM service with a large reused prefix (a RAG or agent app with a big system prompt + documents) and add prompt caching, then prove it cut input cost and TTFT without changing outputs — measuring the cache-hit rate, cost, and latency before and after.

Requirements
Acceptance criteria
  • A before/after table: cache-hit rate, input cost per 1k requests (write/read priced separately), and p50/p99 TTFT — all measured under the same load, not estimated.
  • The usage block on a warm request shows cache_read_input_tokens covering the full prefix and cache_creation near zero — proving reads, not repeated writes.
  • Cached input cost on the stable prefix drops by roughly an order of magnitude (≈90%) versus the uncached baseline, and TTFT on warm requests measurably improves.
  • A one-paragraph write-up of the TTL decision with the gap distribution that justifies it, and confirmation that outputs are byte-identical cached vs uncached.
Senior stretch
  • Add a deliberately-poisoned variant (inject 'Current time: {now}' at the top of the system prompt) and show, with the usage block, the hit rate collapsing to zero and the input bill jumping — then revert and re-confirm. Make the stability test catch it.
  • Stack multiple breakpoints on a layered prompt (after tools, after system, after the document) and demonstrate that changing only the document re-writes just the tail while the tools+system prefix keeps reading at 0.1x.
  • Add a cache-hit-rate panel to your dashboard from runtime metrics and an alert that fires when the rolling hit rate drops below a threshold — the missing signal that lets prefix poisoning hide for weeks.
  • Compare a 5-minute and 1-hour TTL under a simulated bursty traffic pattern (clusters with >5-minute gaps) and quantify how many writes the 1-hour tier converts into reads, then compute whether the 2x write premium paid off.
Recap

This is the loop you will run on every caching change in production: order the prompt stable-first, place the breakpoint on the last unchanging block, read the usage block to compute a true cache-hit rate, defend the prefix with a byte-identical test, choose the TTL from your real gap distribution, and verify the cost and TTFT win with before/after numbers under identical load. Doing it once on your own service makes the production version — including catching the silent poisoner — muscle memory.

Continue the climb ↑Tool calls: the round-trip loop, schema validation, and the guard against runaway agents
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.