Crux Read real composition snippets — a cache layout, a streaming/tool state machine, a budget-gated agent loop, and a RAG retrieve+inject path — and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
The seams show up in code, not in slides. Read each snippet the way you’d read a PR for a production LLM feature: find the boundary where one layer quietly violates the next layer’s assumption, and pick the fix a senior engineer makes first.
Goal
Practise the loop you run when composing layers: read the request layout or loop, find which boundary breaks under real traffic, and reach for the structural fix — not a tuning knob.
The breakpoint is placed after the retrieved chunks. What does this layout actually cache, and how do you fix it?
Heads-up The cache reuses the longest identical prefix up to the LAST breakpoint. With the breakpoint after the chunks, the cached span includes the chunks, which change every request — so it never hits. Only content before the chunks should be under the breakpoint.
Heads-up Per-request chunks are exactly what you don't want cached — they change every query, so caching them is wasted. The reusable, expensive part is the long static rules + schemas; those must sit before the breakpoint.
Heads-up More breakpoints inside per-request content can't help — the prefix still differs from the first byte the chunks change. The fix is to keep dynamic content out of the cached prefix entirely, not to add breakpoints to it.
Snippet 2 — the streaming / tool state machine
async for event in stream: if event.type == "content_block_delta": ui.append(event.delta.text) elif event.type == "message_stop": ui.done() # spinner resolves here# (no other branches)
Quiz
Completed
A tool-using turn streams text, then ends with stop_reason: tool_use. Trace this loop's behaviour and pick the fix.
Heads-up message_stop fires for every stop_reason, including tool_use. Treating it as 'turn complete' resolves the UI while the model is actually waiting for a tool result you never sent.
Heads-up Appending deltas is correct streaming behaviour. The missing branch is handling the tool_use stop reason — buffering changes nothing about the unhandled tool transition.
Heads-up A timeout hides the stall instead of completing the turn; the user still gets a half-answer. You must detect the tool_use stop, run the tool, and resume the turn.
Snippet 3 — the agent loop
def run_agent(task): msgs = [task] while True: resp = client.messages.create(model=MODEL, messages=msgs, tools=TOOLS) if resp.stop_reason != "tool_use": return resp for call in tool_uses(resp): msgs.append(tool_result(call, execute(call))) # transcript grows each loop
Quiz
Completed
A malformed tool result makes the model rephrase and retry the same call forever. What does this loop cost, and what is the minimal fix?
Heads-up Cost is not flat — msgs grows every loop and the whole transcript is resent each call, so per-step input cost rises with the iteration count. An unterminated loop is both slow AND super-linearly expensive.
Heads-up Retrying a deterministically malformed result just adds more iterations — it feeds the loop. The loop needs a terminating gate (step/budget caps + dedupe), not more attempts.
Heads-up There's no exception — the tool returns a value the model dislikes, so it loops on valid responses. Only an explicit step/budget ceiling and call dedupe stop it.
Snippet 4 — the RAG retrieve + cost calc
hits = vectordb.query(embed(user_q), top_k=20) # no rerank, no thresholdcontext = "\n".join(h.text for h in hits) # ~20 chunks × ~800 tok ≈ 16k tok# input ≈ 16k context + 2k prompt = 18k tok; price $3 / 1M input tokcost_per_call = 18_000 / 1_000_000 * 3 # ≈ $0.054, every request, uncached
Quiz
Completed
This retrieve-and-inject path is correct in isolation but expensive in the composed app. Read the numbers and pick the highest-leverage change.
Heads-up Doubling top_k doubles the dynamic token cost and the distraction, while recall past the first few relevant chunks adds little. The lever is fewer, better chunks (rerank + threshold), not more chunks.
Heads-up This context is per-query, so it can't be cached — caching needs a stable prefix. The win is shrinking the dynamic context (rerank) and keeping it out of the cached prefix, not caching what changes every call.
Heads-up A cheaper model lowers the per-token price but you're still shipping 16k of mostly-irrelevant context every call, hurting quality and cost. Fix the retrieval first: rerank down to the few chunks that matter.
Recap
Composition bugs are visible in the code: a cache breakpoint placed after per-request RAG chunks caches nothing reusable; a streaming loop with no tool_use branch resolves the UI while the model waits for a tool result; an agent loop with no step/budget gate resends a growing transcript forever; and an unreranked top-20 retrieval injects ~16k uncacheable tokens at ~$0.054 a call. The fix in every case is structural — relocate the dynamic content, branch on the stop reason, gate the loop, rerank the context — not a tuning knob.