AI / LLM Integration
Composing LLM apps: free-recall review
Retrieval beats re-reading. For each prompt, reconstruct a full answer from memory — across the whole track, not one layer — before you open the model answer. The effort of recall is what makes the seam-level reasoning stick.
Reconstruct the track’s spine: how caching, RAG, streaming, tool calls, agent loops, and evals compose — and where each pair of correct layers breaks at the seam.
- 01Why can a RAG-backed assistant with a long static system prompt still see near-zero cache hit rate in prod, and how do you fix it?
- 02Why is a streamed turn a state machine rather than a token pipe, and what happens if you ignore that?
- 03Why are token-budget alerts not the same as budget enforcement for an agent, and what does enforcement look like?
- 04Why does a green offline eval suite still let a retrieval regression ship, and how do you close the gap?
- 05State the 'bug lives in the seam' thesis and how it changes how you debug a composed LLM app.
- 06Walk the order you'd debug a composed assistant whose cost tripled and answers feel slower after shipping.
If you could reconstruct each answer from memory, you hold the track’s spine: caching is a byte-for-byte prefix match, so dynamic RAG context belongs after the breakpoint; the stream is a state machine where tool_use is a transition and every id must be paired; agent loops need enforced step/dollar caps, not alerts; evals must run the live retrieval path or retrieval regressions ship green. And the meta-lesson over all of it: the bug lives in the seam — trace one real request end to end and model the whole flow.