AI / LLM Integration AI · 08 · 01

Composing a production LLM app: the bug lives in the seam

Caching, RAG, streaming, tools, agents, and evals each pass their own tests, then fail together. Trace one request end to end, because the bug lives in the seam between two correct layers — not inside either one.

AI Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

The assistant passed every unit. Cache test: 90% hit rate. RAG test: top-5 recall is great. Streaming test: tokens flow. Tool test: the function fires. Agent test: it finishes the task. Then you ship the RAG-backed agentic assistant and the bill triples while answers feel slower. Nobody’s component is broken. The cache read rate dropped to near zero in prod because RAG context is now stitched into the cached prefix — and it changes on every single request. Six green components, one red invoice.

By the end of this lesson you’ll know exactly where each of those six layers can silently break its neighbor — and the one trace that makes every seam visible.

One request, every layer

A capstone request looks innocent: a user asks “what’s our refund window for EU orders?” and your assistant retrieves policy docs, reasons, maybe calls a tool, and streams an answer. Under the hood that single turn crosses six layers built in earlier units, each tested alone:

Prompt cache — a long static prefix (system rules, tool schemas) marked with cache_control so repeat calls skip re-encoding it.
RAG — retrieve the top-k policy chunks for this query and inject them as context.
Tool calls — the model may call lookup_order or get_policy mid-turn.
Streaming — tokens are sent to the user as they’re generated, over SSE.
Agent loop — if the first answer needs another retrieval or tool, loop again.
Evals — an offline suite that gate-keeps deploys.

Each layer is correct in isolation. The production failure is always at a seam — where the output contract of one layer silently violates the input assumption of the next. You cannot find these by testing pieces. You find them by tracing one real request through every layer and asking, at each boundary, “what does the next layer assume that this layer just changed?”

Together these six layers mean that testing each in isolation is not enough — what your unit tests miss is exactly the contract mismatch between step N’s output and step N+1’s assumption. Without that cross-layer view, seam bugs ship green every time.

Seam 1: RAG poisons the cache prefix

Anthropic prompt caching is a prefix match: the request is laid out as tools → system → messages, and the cache reuses the longest byte-for-byte identical prefix up to your last cache_control breakpoint. Default TTL is 5 minutes (or 1 hour). The naive composition is to build one big system block: rules + tool schemas + “Here are the relevant docs:” + the retrieved chunks. It demos perfectly. In prod the retrieved chunks differ on every query, so the prefix differs from byte one, the cache read rate collapses toward zero, and you pay full input price every call — exactly the case where caching mattered most.

The fix is a seam-aware layout: put the static prefix (system rules, tool schemas) before the breakpoint, and the per-request RAG context after it, in the messages. The static prefix stays cached; only the cheap dynamic tail re-encodes. This is the whole reason cache_control lets you place breakpoints at boundaries instead of caching the entire request blob.

Seam	Each side is correct	Composed failure	Fix at the boundary
RAG → cache	Retrieval ranks well; cache hits in unit test	Per-request chunks sit inside the cached prefix → hit rate ≈ 0, full input cost	Static prefix before the breakpoint; RAG after it
Tools → streaming	Tool fires; stream delivers tokens	`stop_reason: tool_use` mid-stream; UI keeps spinning waiting for prose	Treat the stream as a state machine: pause render, run tool, resume
Agent loop → budget	Loop converges on the happy path	Bad input retries forever; no step/$ ceiling → runaway spend	Hard caps: `step ≤ MAX`, `spent ≤ BUDGET`, dedupe repeated calls
Evals → retrieval	Generation evals pass on fixed context	Suite never varies retrieval → retrieval regressions ship green	End-to-end evals that include the live retrieval path

Seam 2: a tool call breaks the stream

Streaming and tool use each work, but they share a wire. A streamed turn does not always end in prose: the model can emit stop_reason: tool_use partway through. If your frontend treats the stream as “tokens until done,” it renders the partial text, then hangs — the spinner never resolves because the real continuation is another request you haven’t sent yet (the tool result, fed back). Worse failure modes from the field: a network glitch truncates the stream with a tool_use stop reason but zero tool-call blocks, so the agent finds nothing to execute and goes idle silently; or a process crash mid-tool-execution orphans the result and the next request is rejected with unexpected tool_use_id found in tool_result block, leaving the session unrecoverable without manual surgery.

The composition rule: the stream is a state machine, not a token pipe. States are text, tool_use_requested, awaiting_tool_result, resumed. The tool-use stop is a transition, not an end. And every tool_use id must be matched by exactly one tool_result in the next request — track them, or the API rejects the whole turn.

▸Why this works

Why does this only bite in prod? In a demo you ask one clean question and the model answers in prose — the tool_use branch never fires, so the streaming-plus-tools seam is never exercised. The first real user who triggers a tool mid-answer is the first traffic that ever crosses that seam. “Works on my machine” here means “I never hit the branch that breaks.”

Seam 3: the agent loop with no budget

An agent loop is “call the model, run tools, feed results back, repeat until done.” The unit test ends because the task succeeds. Production input does not cooperate: a tool returns a malformed result, the model rephrases and retries, the result is still malformed, it retries again — and because each iteration ships the entire growing transcript back to the model, cost climbs super-linearly. The widely-cited postmortem: four agents with no step cap entered a loop, ran for 11 days, and burned $47,000 before anyone noticed. The lesson there is sharp — token-budget alerts are not enforcement. Alerts fire after the spend; enforcement refuses the next call.

The composition fix is three asserts before every model call: step ≤ MAX_STEPS, spent ≤ BUDGET_USD, and hash(tool_name, args) not in seen to kill repeat-the-same-call loops. A budget-aware gateway returns an error instead of forwarding the request once the ceiling is hit. Teams that add this typically cut agent cost 55–75%.

Pick the best fit

Your RAG-backed assistant has cache hit rate near 0% in prod despite a long static system prompt. Pick the fix.

The senior thesis: model the flow, not the parts

The through-line of every seam above: a layer that is correct by its own contract changes something the next layer silently depended on. RAG changed the prefix the cache assumed was stable. A tool call changed the stream the renderer assumed was prose. Real input changed the loop the budget assumed would terminate. The retrieval path changed under an eval suite that assumed fixed context. None of these is a bug in a component — each is a bug between components. So the senior skill at the capstone is not building better pieces; it’s threat- and cost-modeling the whole request path: what does each boundary assume, and which upstream layer can violate it? Trace one real request end to end and the seams light up.

Quiz

A streamed turn ends with stop_reason: tool_use partway through, and the UI hangs on a spinner. What's the correct mental model?

Quiz

Your offline eval suite is green on every deploy, but users report worse answers after a retriever change. Why did evals miss it?

Order the steps

Order how to debug a composed LLM app whose cost tripled and answers feel slower:

1 Trace ONE real production request through every layer (cache, RAG, tools, stream, loop)
2 At each boundary, ask what the next layer assumed that this layer just changed
3 Spot the seam: per-request RAG context is inside the cached prefix → hit rate ≈ 0
4 Move RAG context after the cache_control breakpoint; keep static rules before it
5 Add an end-to-end eval that varies retrieval, so the seam can't silently regress again

A single turn flows through prompt cache → RAG → tools → streaming → agent loop, gated by evals. Each layer passes its own test; the failures are at the seams: RAG chunks bust the cached prefix, a tool_use stop stalls the stream, an uncapped loop runs away, and generation evals on frozen context never exercise live retrieval. Trace one real request and the seams light up.

Recall before you leave

01
Explain why a RAG-backed assistant with a long static system prompt can still see near-zero cache hit rate in production, and how to fix it.
02
What is the 'bug lives in the seam' thesis, and how does it change how you debug a composed LLM app versus a single component?

Recap

A production LLM application is a composition of layers — prompt caching, RAG, tool calls, streaming, an agent loop, and evals — and every one of them can pass its own unit test while the system fails. The failures live in the seams. Per-request RAG context stitched into the cached prefix collapses the cache hit rate to near zero, because caching is a byte-for-byte prefix match and the chunks change every query; the fix is to keep the static prefix before the cache_control breakpoint and the dynamic context after it. A tool-use stop mid-stream hangs a renderer that assumes prose, so model the stream as a state machine and match every tool_use id with a tool_result. An agent loop with no step or dollar ceiling retries forever on bad input — the famous case ran 11 days for $47,000 — so enforce caps, not just alerts. And generation evals on frozen context ship retrieval regressions green, so add end-to-end evals that vary the live retrieval path. The senior move is not better components; it is to trace one real request through every layer and, at each boundary, ask what the next layer assumed that the last one just changed. Now when you see six green component tests and a red invoice in prod, your first move is to trace one real request — the seam will light up within two boundaries.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.