AI / LLM Integration
RAG architecture: multiple-choice review
Six questions that cut across the whole pipeline. None is a definition to recite — each is a decision you make while a RAG system is silently returning confident, wrong answers in production.
Confirm you can connect chunking, embedding cost, two-stage retrieval, context assembly, and the retrieval-driven failure mode into one diagnosis — the synthesis the unit built toward.
A support bot confidently reports a specific but wrong Q3 churn number; the answer survived review for a month. Where does the fault most often lie, and why does it stay invisible?
A policy doc states a rule whose exception lives one paragraph later. Queries about the exception return half-truths. Which chunking change is the right first move?
Index search p99 is too high at 3072 dims and storage cost is climbing. What's the senior trade before changing infrastructure?
Why is 'retrieve wide (k=20–50) then rerank narrow to 3–8 with a cross-encoder' better than a single embedding top-3?
You have a 128k-token window, so you stuff all 40 retrieved chunks in retrieval order and put the decisive one near the middle. What does the literature predict?
A RAG bot that worked last week now returns last quarter's policy with no code change and no error. Most likely cause, and the structural fix?
The unit’s through-line is one diagnosis: retrieval, not generation, is where production RAG fails. Chunking sets the ceiling (size to the answer-bearing unit, overlap so seams survive); embedding dimensionality is a real cost lever you can truncate; the recall-vs-precision split is solved by retrieve-wide-then-rerank-narrow; context ordering must dodge lost-in-the-middle by placing best evidence at the edges; and stale or poisoned indexes turn a silent miss into a confident wrong answer — which you defend against with a score gate, freshness, and an instruction to answer only from context or say “I don’t know.”