awesome-everything RU
↑ Back to the climb

AI / LLM Integration

Composing a production LLM app: the bug lives in the seam

Crux Caching, RAG, streaming, tools, agents, and evals each pass their own tests, then fail together. Trace one request end to end, because the bug lives in the seam between two correct layers — not inside either one.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 17 min

The assistant passed every unit. Cache test: 90% hit rate. RAG test: top-5 recall is great. Streaming test: tokens flow. Tool test: the function fires. Agent test: it finishes the task. Then you ship the RAG-backed agentic assistant and the bill triples while answers feel slower. Nobody’s component is broken. The cache read rate dropped to near zero in prod because RAG context is now stitched into the cached prefix — and it changes on every single request. Six green components, one red invoice.

One request, every layer

A capstone request looks innocent: a user asks “what’s our refund window for EU orders?” and your assistant retrieves policy docs, reasons, maybe calls a tool, and streams an answer. Under the hood that single turn crosses six layers built in earlier units, each tested alone:

  1. Prompt cache — a long static prefix (system rules, tool schemas) marked with cache_control so repeat calls skip re-encoding it.
  2. RAG — retrieve the top-k policy chunks for this query and inject them as context.
  3. Tool calls — the model may call lookup_order or get_policy mid-turn.
  4. Streaming — tokens are sent to the user as they’re generated, over SSE.
  5. Agent loop — if the first answer needs another retrieval or tool, loop again.
  6. Evals — an offline suite that gate-keeps deploys.

Each layer is correct in isolation. The production failure is always at a seam — where the output contract of one layer silently violates the input assumption of the next. You cannot find these by testing pieces. You find them by tracing one real request through every layer and asking, at each boundary, “what does the next layer assume that this layer just changed?”

Seam 1: RAG poisons the cache prefix

Anthropic prompt caching is a prefix match: the request is laid out as tools → system → messages, and the cache reuses the longest byte-for-byte identical prefix up to your last cache_control breakpoint. Default TTL is 5 minutes (or 1 hour). The naive composition is to build one big system block: rules + tool schemas + “Here are the relevant docs:” + the retrieved chunks. It demos perfectly. In prod the retrieved chunks differ on every query, so the prefix differs from byte one, the cache read rate collapses toward zero, and you pay full input price every call — exactly the case where caching mattered most.

The fix is a seam-aware layout: put the static prefix (system rules, tool schemas) before the breakpoint, and the per-request RAG context after it, in the messages. The static prefix stays cached; only the cheap dynamic tail re-encodes. This is the whole reason cache_control lets you place breakpoints at boundaries instead of caching the entire request blob.

SeamEach side is correctComposed failureFix at the boundary
RAG → cacheRetrieval ranks well; cache hits in unit testPer-request chunks sit inside the cached prefix → hit rate ≈ 0, full input costStatic prefix before the breakpoint; RAG after it
Tools → streamingTool fires; stream delivers tokensstop_reason: tool_use mid-stream; UI keeps spinning waiting for proseTreat the stream as a state machine: pause render, run tool, resume
Agent loop → budgetLoop converges on the happy pathBad input retries forever; no step/$ ceiling → runaway spendHard caps: step ≤ MAX, spent ≤ BUDGET, dedupe repeated calls
Evals → retrievalGeneration evals pass on fixed contextSuite never varies retrieval → retrieval regressions ship greenEnd-to-end evals that include the live retrieval path

Seam 2: a tool call breaks the stream

Streaming and tool use each work, but they share a wire. A streamed turn does not always end in prose: the model can emit stop_reason: tool_use partway through. If your frontend treats the stream as “tokens until done,” it renders the partial text, then hangs — the spinner never resolves because the real continuation is another request you haven’t sent yet (the tool result, fed back). Worse failure modes from the field: a network glitch truncates the stream with a tool_use stop reason but zero tool-call blocks, so the agent finds nothing to execute and goes idle silently; or a process crash mid-tool-execution orphans the result and the next request is rejected with unexpected tool_use_id found in tool_result block, leaving the session unrecoverable without manual surgery.

The composition rule: the stream is a state machine, not a token pipe. States are text, tool_use_requested, awaiting_tool_result, resumed. The tool-use stop is a transition, not an end. And every tool_use id must be matched by exactly one tool_result in the next request — track them, or the API rejects the whole turn.

Why this works

Why does this only bite in prod? In a demo you ask one clean question and the model answers in prose — the tool_use branch never fires, so the streaming-plus-tools seam is never exercised. The first real user who triggers a tool mid-answer is the first traffic that ever crosses that seam. “Works on my machine” here means “I never hit the branch that breaks.”

Seam 3: the agent loop with no budget

An agent loop is “call the model, run tools, feed results back, repeat until done.” The unit test ends because the task succeeds. Production input does not cooperate: a tool returns a malformed result, the model rephrases and retries, the result is still malformed, it retries again — and because each iteration ships the entire growing transcript back to the model, cost climbs super-linearly. The widely-cited postmortem: four agents with no step cap entered a loop, ran for 11 days, and burned $47,000 before anyone noticed. The lesson there is sharp — token-budget alerts are not enforcement. Alerts fire after the spend; enforcement refuses the next call.

The composition fix is three asserts before every model call: step ≤ MAX_STEPS, spent ≤ BUDGET_USD, and hash(tool_name, args) not in seen to kill repeat-the-same-call loops. A budget-aware gateway returns an error instead of forwarding the request once the ceiling is hit. Teams that add this typically cut agent cost 55–75%.

Pick the best fit

Your RAG-backed assistant has cache hit rate near 0% in prod despite a long static system prompt. Pick the fix.

The senior thesis: model the flow, not the parts

The through-line of every seam above: a layer that is correct by its own contract changes something the next layer silently depended on. RAG changed the prefix the cache assumed was stable. A tool call changed the stream the renderer assumed was prose. Real input changed the loop the budget assumed would terminate. The retrieval path changed under an eval suite that assumed fixed context. None of these is a bug in a component — each is a bug between components. So the senior skill at the capstone is not building better pieces; it’s threat- and cost-modeling the whole request path: what does each boundary assume, and which upstream layer can violate it? Trace one real request end to end and the seams light up.

Quiz

A streamed turn ends with stop_reason: tool_use partway through, and the UI hangs on a spinner. What's the correct mental model?

Quiz

Your offline eval suite is green on every deploy, but users report worse answers after a retriever change. Why did evals miss it?

Order the steps

Order how to debug a composed LLM app whose cost tripled and answers feel slower:

  1. 1 Trace ONE real production request through every layer (cache, RAG, tools, stream, loop)
  2. 2 At each boundary, ask what the next layer assumed that this layer just changed
  3. 3 Spot the seam: per-request RAG context is inside the cached prefix → hit rate ≈ 0
  4. 4 Move RAG context after the cache_control breakpoint; keep static rules before it
  5. 5 Add an end-to-end eval that varies retrieval, so the seam can't silently regress again
Recall before you leave
  1. 01
    Explain why a RAG-backed assistant with a long static system prompt can still see near-zero cache hit rate in production, and how to fix it.
  2. 02
    What is the 'bug lives in the seam' thesis, and how does it change how you debug a composed LLM app versus a single component?
Recap

A production LLM application is a composition of layers — prompt caching, RAG, tool calls, streaming, an agent loop, and evals — and every one of them can pass its own unit test while the system fails. The failures live in the seams. Per-request RAG context stitched into the cached prefix collapses the cache hit rate to near zero, because caching is a byte-for-byte prefix match and the chunks change every query; the fix is to keep the static prefix before the cache_control breakpoint and the dynamic context after it. A tool-use stop mid-stream hangs a renderer that assumes prose, so model the stream as a state machine and match every tool_use id with a tool_result. An agent loop with no step or dollar ceiling retries forever on bad input — the famous case ran 11 days for $47,000 — so enforce caps, not just alerts. And generation evals on frozen context ship retrieval regressions green, so add end-to-end evals that vary the live retrieval path. The senior move is not better components; it is to trace one real request through every layer and, at each boundary, ask what the next layer assumed that the last one just changed. Threat- and cost-model the whole flow.

Continue the climb ↑Composing LLM apps: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.