AI / LLM Integration AI · 08 · 10

Capstone: ship a composed LLM feature that holds at the seams

Capstone — build one production LLM feature that composes prompt caching, tool calls, RAG, and streaming under an enforced cost budget, gated by end-to-end evals, and prove every seam holds.

AI Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Every earlier unit gave you one layer that passed its own test. The capstone is the part that breaks in production: putting all of them in one request path and proving the seams hold. Build a real RAG-backed assistant that caches, calls tools, streams, stays inside a hard budget — and let an end-to-end eval suite tell you it actually works.

Goal

Turn the track’s mental model into a shippable feature: compose caching + RAG + tools + streaming under an enforced budget, then verify at every seam with measurements and end-to-end evals — not per-component tests.

Project

0 of 7

Objective

Build a production-grade RAG-backed assistant for a real corpus (your docs, a support KB, or a policy set) that composes prompt caching, tool calls, RAG, and streaming under an enforced per-conversation cost budget, gated by end-to-end evals — and prove each seam holds with before/after numbers.

Requirements

Acceptance criteria

A seam table with before/after numbers: cache hit rate (target ≥70% on the static prefix under varied queries), cost per conversation (under the enforced budget), streamed time-to-first-token, and end-to-end answer/retrieval scores — all measured, not estimated.
A demonstrated runaway-prevention test: feed an input that triggers a repeat-call loop and show the budget gate refuses the next call and returns an error instead of spending unbounded dollars.
A demonstrated stream/tool test: a question that triggers a tool mid-turn streams text, runs the tool, and resumes to a complete answer with the spinner resolving correctly — and a truncated/orphaned tool_use is detected and recovered, not left dangling.
A demonstrated eval gate: intentionally degrade retrieval (e.g. drop the reranker or shrink top_k) and show the end-to-end suite goes red in CI while a generation-only suite on frozen context would have stayed green.
A one-page write-up tracing one real request through all four layers, naming what each boundary assumes and how your composition prevents the upstream layer from violating it.

Senior stretch

Add an on-call runbook: the four seam symptoms (cache hit rate ≈ 0, spinner hang on tool_use, runaway loop, green-evals-but-worse-answers), the trace-one-request triage, and the structural fix for each.
Add prompt-injection defence at the RAG seam: treat retrieved chunks as untrusted data, fence them from instructions, and add an eval case where a poisoned chunk tries to override the system prompt — show it's contained.
Add a budget-aware gateway in front of the agent that tracks spend per conversation across requests and surfaces a remaining-budget header, so the client can degrade gracefully near the cap.
Run a small A/B on context size: compare answer quality and cost at top_k 3 vs 10 vs 20 and show that reranking to fewer chunks improves both cost and faithfulness, not just cost.

Recap

This is the build you’ll repeat for every real LLM feature: compose the layers in one request path, lay the cache around the dynamic content, treat the stream as a state machine, enforce the budget with a hard gate, rerank retrieval, and gate deploys with end-to-end evals that include live retrieval. Then prove each seam holds with before/after numbers and a one-request trace. Six green components are not a working system — a composition whose seams hold is. Building it once on a real corpus makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.