AI / LLM Integration
Capstone: ship a composed LLM feature that holds at the seams
Every earlier unit gave you one layer that passed its own test. The capstone is the part that breaks in production: putting all of them in one request path and proving the seams hold. Build a real RAG-backed assistant that caches, calls tools, streams, stays inside a hard budget — and let an end-to-end eval suite tell you it actually works.
Turn the track’s mental model into a shippable feature: compose caching + RAG + tools + streaming under an enforced budget, then verify at every seam with measurements and end-to-end evals — not per-component tests.
Build a production-grade RAG-backed assistant for a real corpus (your docs, a support KB, or a policy set) that composes prompt caching, tool calls, RAG, and streaming under an enforced per-conversation cost budget, gated by end-to-end evals — and prove each seam holds with before/after numbers.
- A seam table with before/after numbers: cache hit rate (target ≥70% on the static prefix under varied queries), cost per conversation (under the enforced budget), streamed time-to-first-token, and end-to-end answer/retrieval scores — all measured, not estimated.
- A demonstrated runaway-prevention test: feed an input that triggers a repeat-call loop and show the budget gate refuses the next call and returns an error instead of spending unbounded dollars.
- A demonstrated stream/tool test: a question that triggers a tool mid-turn streams text, runs the tool, and resumes to a complete answer with the spinner resolving correctly — and a truncated/orphaned tool_use is detected and recovered, not left dangling.
- A demonstrated eval gate: intentionally degrade retrieval (e.g. drop the reranker or shrink top_k) and show the end-to-end suite goes red in CI while a generation-only suite on frozen context would have stayed green.
- A one-page write-up tracing one real request through all four layers, naming what each boundary assumes and how your composition prevents the upstream layer from violating it.
- Add an on-call runbook: the four seam symptoms (cache hit rate ≈ 0, spinner hang on tool_use, runaway loop, green-evals-but-worse-answers), the trace-one-request triage, and the structural fix for each.
- Add prompt-injection defence at the RAG seam: treat retrieved chunks as untrusted data, fence them from instructions, and add an eval case where a poisoned chunk tries to override the system prompt — show it's contained.
- Add a budget-aware gateway in front of the agent that tracks spend per conversation across requests and surfaces a remaining-budget header, so the client can degrade gracefully near the cap.
- Run a small A/B on context size: compare answer quality and cost at top_k 3 vs 10 vs 20 and show that reranking to fewer chunks improves both cost and faithfulness, not just cost.
This is the build you’ll repeat for every real LLM feature: compose the layers in one request path, lay the cache around the dynamic content, treat the stream as a state machine, enforce the budget with a hard gate, rerank retrieval, and gate deploys with end-to-end evals that include live retrieval. Then prove each seam holds with before/after numbers and a one-request trace. Six green components are not a working system — a composition whose seams hold is. Building it once on a real corpus makes the production version muscle memory.