AI / LLM Integration AI · 01 · 09

Prompt caching: request and usage reading

Read real Anthropic SDK request bodies and usage blocks, predict where the cache breakpoint lands and whether it hits, and pick the highest-leverage fix.

AI Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

The request body decides what gets cached; the usage block tells you whether it worked. Read both, predict the behaviour, and choose the fix a senior engineer makes first — before touching the TTL.

Goal

Practise the loop you run on every caching incident: read where the cache_control breakpoint sits, read the usage fields to confirm a hit or a silent miss, and reach for the ordering fix before the tuning knob.

Snippet 1 — where does the breakpoint go?

system = [
    {"type": "text", "text": SYSTEM_PROMPT},          # ~2k tokens, stable
    {"type": "text", "text": BIG_POLICY_DOC,           # ~28k tokens, stable
     "cache_control": {"type": "ephemeral"}},
]
messages = [
    {"role": "user", "content": f"As of {now()}: {question}"},
]

Quiz

The breakpoint is placed correctly on the last stable block, yet the cache never hits. Why?

Snippet 2 — the usage block

{
  "usage": {
    "input_tokens": 41,
    "cache_creation_input_tokens": 30218,
    "cache_read_input_tokens": 0,
    "output_tokens": 215
  }
}

Quiz

This usage block recurs on every request in a steady, high-frequency workload. What does it tell you?

Snippet 3 — TTL and tool ordering

client.messages.create(
    model="claude-sonnet-4-6",
    tools=serialize_tools(registry.values()),   # dict → list, order not guaranteed
    system=[{"type": "text", "text": SYSTEM,
             "cache_control": {"type": "ephemeral", "ttl": "1h"}}],
    messages=[{"role": "user", "content": q}],
)

Quiz

The 1-hour TTL is set and the system prompt is stable, but hit rate is erratic across deploys. What is the highest-leverage fix?

Snippet 4 — break-even arithmetic

# Sonnet 4.6: base input $3.00/MTok, cache write (5m) $3.75/MTok, cache read $0.30/MTok
# 20k-token stable prefix, re-read N times within the TTL window
write_cost = 20_000/1e6 * 3.75          # one write
read_cost  = 20_000/1e6 * 0.30 * N      # N reads
uncached   = 20_000/1e6 * 3.00 * (N+1)  # same N+1 requests, no cache

Quiz

With this pricing, after how many reads does caching the 20k prefix become cheaper than not caching at all?

Recap

Every caching question is read in the request body and the usage block. The breakpoint caches the whole prefix up to and including its block, so it belongs on the last stable block — but a stable breakpoint is worthless if the blocks in front of it (tools first, then system) are not byte-identical, which is why non-deterministic tool ordering and re-serialised whitespace are top poisoners. The usage fields are the only truth: cache_creation high with cache_read near zero on steady traffic is a poisoned prefix, not success. And the break-even arithmetic is brutal in caching’s favour — a re-read prefix beats full rate after a single read, so the real work is keeping the prefix stable, not tuning the TTL. Now when you open a usage block and see persistent cache_creation with zero cache_read, you know the question is not “what TTL should I use” but “what changed near token zero.”

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.