AI / LLM Integration
RAG architecture: free-recall review
Retrieval beats re-reading. For each prompt, reconstruct a full answer from memory before you open the model answer — the effort of recall is what makes the RAG pipeline stick as a mental model.
Rebuild the unit’s spine from memory — why retrieval (not generation) dominates failures, the chunking knife-edge, embedding cost, the two-stage retrieve-then-rerank pattern, context ordering, and the abstain gate.
- 01Why is retrieval, not generation, the dominant failure mode in production RAG — and what does a retrieval miss actually do?
- 02Explain the chunking size-vs-recall knife-edge and the role of overlap.
- 03How is embedding dimensionality a cost lever, and what's the Matryoshka trade?
- 04Describe the two-stage retrieve-wide-then-rerank-narrow pattern, and why one embedding top-k isn't enough.
- 05What is 'lost in the middle', and how should you assemble the final context because of it?
- 06What is the confident-hallucination failure mode, and how do you defend against it (including stale and poisoned indexes)?
If you reconstructed each answer from memory, you hold the unit’s spine: retrieval — not generation — is where production RAG fails; chunking sets the ceiling (size to the answer-bearing unit, ~10–15% overlap); embedding dimensionality is a truncatable cost lever; the recall-vs-precision split is solved by retrieve-wide-then-rerank-narrow; context ordering must dodge lost-in-the-middle by putting best evidence at the edges; and the confident-hallucination failure — made worse by stale and poisoned indexes — is defended by a score gate, a freshness pipeline, and an instruction to answer only from context or abstain.