AI / LLM Integration
Composing a production LLM app: the bug lives in the seam
The assistant passed every unit. Cache test: 90% hit rate. RAG test: top-5 recall is great. Streaming test: tokens flow. Tool test: the function fires. Agent test: it finishes the task. Then you ship the RAG-backed agentic assistant and the bill triples while answers feel slower. Nobody’s component is broken. The cache read rate dropped to near zero in prod because RAG context is now stitched into the cached prefix — and it changes on every single request. Six green components, one red invoice.
One request, every layer
A capstone request looks innocent: a user asks “what’s our refund window for EU orders?” and your assistant retrieves policy docs, reasons, maybe calls a tool, and streams an answer. Under the hood that single turn crosses six layers built in earlier units, each tested alone:
- Prompt cache — a long static prefix (system rules, tool schemas) marked with
cache_controlso repeat calls skip re-encoding it. - RAG — retrieve the top-k policy chunks for this query and inject them as context.
- Tool calls — the model may call
lookup_orderorget_policymid-turn. - Streaming — tokens are sent to the user as they’re generated, over SSE.
- Agent loop — if the first answer needs another retrieval or tool, loop again.
- Evals — an offline suite that gate-keeps deploys.
Each layer is correct in isolation. The production failure is always at a seam — where the output contract of one layer silently violates the input assumption of the next. You cannot find these by testing pieces. You find them by tracing one real request through every layer and asking, at each boundary, “what does the next layer assume that this layer just changed?”
Seam 1: RAG poisons the cache prefix
Anthropic prompt caching is a prefix match: the request is laid out as tools → system → messages, and the cache reuses the longest byte-for-byte identical prefix up to your last cache_control breakpoint. Default TTL is 5 minutes (or 1 hour). The naive composition is to build one big system block: rules + tool schemas + “Here are the relevant docs:” + the retrieved chunks. It demos perfectly. In prod the retrieved chunks differ on every query, so the prefix differs from byte one, the cache read rate collapses toward zero, and you pay full input price every call — exactly the case where caching mattered most.
The fix is a seam-aware layout: put the static prefix (system rules, tool schemas) before the breakpoint, and the per-request RAG context after it, in the messages. The static prefix stays cached; only the cheap dynamic tail re-encodes. This is the whole reason cache_control lets you place breakpoints at boundaries instead of caching the entire request blob.
| Seam | Each side is correct | Composed failure | Fix at the boundary |
|---|---|---|---|
| RAG → cache | Retrieval ranks well; cache hits in unit test | Per-request chunks sit inside the cached prefix → hit rate ≈ 0, full input cost | Static prefix before the breakpoint; RAG after it |
| Tools → streaming | Tool fires; stream delivers tokens | stop_reason: tool_use mid-stream; UI keeps spinning waiting for prose | Treat the stream as a state machine: pause render, run tool, resume |
| Agent loop → budget | Loop converges on the happy path | Bad input retries forever; no step/$ ceiling → runaway spend | Hard caps: step ≤ MAX, spent ≤ BUDGET, dedupe repeated calls |
| Evals → retrieval | Generation evals pass on fixed context | Suite never varies retrieval → retrieval regressions ship green | End-to-end evals that include the live retrieval path |
Seam 2: a tool call breaks the stream
Streaming and tool use each work, but they share a wire. A streamed turn does not always end in prose: the model can emit stop_reason: tool_use partway through. If your frontend treats the stream as “tokens until done,” it renders the partial text, then hangs — the spinner never resolves because the real continuation is another request you haven’t sent yet (the tool result, fed back). Worse failure modes from the field: a network glitch truncates the stream with a tool_use stop reason but zero tool-call blocks, so the agent finds nothing to execute and goes idle silently; or a process crash mid-tool-execution orphans the result and the next request is rejected with unexpected tool_use_id found in tool_result block, leaving the session unrecoverable without manual surgery.
The composition rule: the stream is a state machine, not a token pipe. States are text, tool_use_requested, awaiting_tool_result, resumed. The tool-use stop is a transition, not an end. And every tool_use id must be matched by exactly one tool_result in the next request — track them, or the API rejects the whole turn.
Why this works
Why does this only bite in prod? In a demo you ask one clean question and the model answers in prose — the tool_use branch never fires, so the streaming-plus-tools seam is never exercised. The first real user who triggers a tool mid-answer is the first traffic that ever crosses that seam. “Works on my machine” here means “I never hit the branch that breaks.”
Seam 3: the agent loop with no budget
An agent loop is “call the model, run tools, feed results back, repeat until done.” The unit test ends because the task succeeds. Production input does not cooperate: a tool returns a malformed result, the model rephrases and retries, the result is still malformed, it retries again — and because each iteration ships the entire growing transcript back to the model, cost climbs super-linearly. The widely-cited postmortem: four agents with no step cap entered a loop, ran for 11 days, and burned $47,000 before anyone noticed. The lesson there is sharp — token-budget alerts are not enforcement. Alerts fire after the spend; enforcement refuses the next call.
The composition fix is three asserts before every model call: step ≤ MAX_STEPS, spent ≤ BUDGET_USD, and hash(tool_name, args) not in seen to kill repeat-the-same-call loops. A budget-aware gateway returns an error instead of forwarding the request once the ceiling is hit. Teams that add this typically cut agent cost 55–75%.
Your RAG-backed assistant has cache hit rate near 0% in prod despite a long static system prompt. Pick the fix.
The senior thesis: model the flow, not the parts
The through-line of every seam above: a layer that is correct by its own contract changes something the next layer silently depended on. RAG changed the prefix the cache assumed was stable. A tool call changed the stream the renderer assumed was prose. Real input changed the loop the budget assumed would terminate. The retrieval path changed under an eval suite that assumed fixed context. None of these is a bug in a component — each is a bug between components. So the senior skill at the capstone is not building better pieces; it’s threat- and cost-modeling the whole request path: what does each boundary assume, and which upstream layer can violate it? Trace one real request end to end and the seams light up.
A streamed turn ends with stop_reason: tool_use partway through, and the UI hangs on a spinner. What's the correct mental model?
Your offline eval suite is green on every deploy, but users report worse answers after a retriever change. Why did evals miss it?
Order how to debug a composed LLM app whose cost tripled and answers feel slower:
- 1 Trace ONE real production request through every layer (cache, RAG, tools, stream, loop)
- 2 At each boundary, ask what the next layer assumed that this layer just changed
- 3 Spot the seam: per-request RAG context is inside the cached prefix → hit rate ≈ 0
- 4 Move RAG context after the cache_control breakpoint; keep static rules before it
- 5 Add an end-to-end eval that varies retrieval, so the seam can't silently regress again
- 01Explain why a RAG-backed assistant with a long static system prompt can still see near-zero cache hit rate in production, and how to fix it.
- 02What is the 'bug lives in the seam' thesis, and how does it change how you debug a composed LLM app versus a single component?
A production LLM application is a composition of layers — prompt caching, RAG, tool calls, streaming, an agent loop, and evals — and every one of them can pass its own unit test while the system fails. The failures live in the seams. Per-request RAG context stitched into the cached prefix collapses the cache hit rate to near zero, because caching is a byte-for-byte prefix match and the chunks change every query; the fix is to keep the static prefix before the cache_control breakpoint and the dynamic context after it. A tool-use stop mid-stream hangs a renderer that assumes prose, so model the stream as a state machine and match every tool_use id with a tool_result. An agent loop with no step or dollar ceiling retries forever on bad input — the famous case ran 11 days for $47,000 — so enforce caps, not just alerts. And generation evals on frozen context ship retrieval regressions green, so add end-to-end evals that vary the live retrieval path. The senior move is not better components; it is to trace one real request through every layer and, at each boundary, ask what the next layer assumed that the last one just changed. Threat- and cost-model the whole flow.