AI / LLM Integration AI · 08 · 09

Composing LLM apps: code and trace reading

Read real composition snippets — a cache layout, a streaming/tool state machine, a budget-gated agent loop, and a RAG retrieve+inject path — and pick the highest-leverage fix.

AI Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

The seams show up in code, not in slides. Read each snippet the way you’d read a PR for a production LLM feature: find the boundary where one layer quietly violates the next layer’s assumption, and pick the fix a senior engineer makes first.

Goal

Practise the loop you run when composing layers: read the request layout or loop, find which boundary breaks under real traffic, and reach for the structural fix — not a tuning knob.

Snippet 1 — the cache layout

system = [
    {"type": "text", "text": SYSTEM_RULES},          # static
    {"type": "text", "text": tool_schemas_json},      # static
    {"type": "text", "text": "Relevant docs:\n" + retrieved_chunks,
     "cache_control": {"type": "ephemeral"}},          # per-request RAG, then breakpoint
]
resp = client.messages.create(model=MODEL, system=system,
                              messages=[{"role": "user", "content": query}])

Quiz

The breakpoint is placed after the retrieved chunks. What does this layout actually cache, and how do you fix it?

Snippet 2 — the streaming / tool state machine

async for event in stream:
    if event.type == "content_block_delta":
        ui.append(event.delta.text)
    elif event.type == "message_stop":
        ui.done()                       # spinner resolves here
# (no other branches)

Quiz

A tool-using turn streams text, then ends with stop_reason: tool_use. Trace this loop's behaviour and pick the fix.

Snippet 3 — the agent loop

def run_agent(task):
    msgs = [task]
    while True:
        resp = client.messages.create(model=MODEL, messages=msgs, tools=TOOLS)
        if resp.stop_reason != "tool_use":
            return resp
        for call in tool_uses(resp):
            msgs.append(tool_result(call, execute(call)))   # transcript grows each loop

Quiz

A malformed tool result makes the model rephrase and retry the same call forever. What does this loop cost, and what is the minimal fix?

Snippet 4 — the RAG retrieve + cost calc

hits = vectordb.query(embed(user_q), top_k=20)        # no rerank, no threshold
context = "\n".join(h.text for h in hits)             # ~20 chunks × ~800 tok ≈ 16k tok
# input ≈ 16k context + 2k prompt = 18k tok; price $3 / 1M input tok
cost_per_call = 18_000 / 1_000_000 * 3                # ≈ $0.054, every request, uncached

Quiz

This retrieve-and-inject path is correct in isolation but expensive in the composed app. Read the numbers and pick the highest-leverage change.

Recap

Composition bugs are visible in the code: a cache breakpoint placed after per-request RAG chunks caches nothing reusable; a streaming loop with no tool_use branch resolves the UI while the model waits for a tool result; an agent loop with no step/budget gate resends a growing transcript forever; and an unreranked top-20 retrieval injects ~16k uncacheable tokens at ~$0.054 a call. The fix in every case is structural — relocate the dynamic content, branch on the stop reason, gate the loop, rerank the context — not a tuning knob. Now when you read a PR that touches caching, streaming, or an agent loop, the first thing you scan is the seam: what does the next layer assume, and does this diff break that assumption?

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.