AI / LLM Integration
RAG architecture: the pipeline that fails at retrieval, not generation
A support bot ships on a doc corpus. A user asks “what’s our Q3 churn?” The retriever pulls a chunk with Q2 churn and another with Q3 revenue — close, but not the answer. The model does not say “I can’t find Q3 churn.” It blends the two and emits a confident, specific, wrong number. Nobody catches it for a month because the answer looks right. The generation step was flawless. The retrieval step missed, and a missed retrieval doesn’t fail loudly — it hallucinates fluently.
The pipeline, end to end
RAG (retrieval-augmented generation) is a fixed sequence of stages, and most production pain lives in the early ones. At index time you chunk documents into passages, embed each chunk into a vector (a fixed-length array of floats), and store those vectors in a vector store with an approximate-nearest-neighbour (ANN) index. At query time you embed the question, run top-k retrieval to pull the k closest chunks, optionally rerank them with a more accurate model, fit the survivors into the context-window budget, assemble the prompt, and let the LLM generate.
Two facts reframe everything below. First: the generator can only be as good as what retrieval hands it — garbage chunks in, confident garbage out. Second: in production RAG, the dominant failure is retrieval, not generation. Industry write-ups put the majority of bad answers on the retrieval side: the wrong chunks were fetched, or the right chunk was never indexed. So the engineering effort that pays off is mostly upstream of the model.
Chunking: the size-vs-recall knife-edge
Chunking is the decision that quietly caps your ceiling. Embed a chunk and you compress its whole meaning into one vector; the chunk is the atomic unit retrieval can ever return. Get it wrong and no reranker recovers.
The tradeoff is sharp. Small chunks (say 128–256 tokens) embed precisely — one vector, one tight idea — so similarity search targets well. But they fragment any rule that spans a boundary: the condition lands in chunk 7, the exception in chunk 8, and a top-k that grabs only one returns a half-truth. Large chunks (800–1000+ tokens) keep context intact but dilute the embedding — one vector now averages several ideas, so the signal for the specific sub-fact you need gets washed out, hurting recall for narrow queries. A common production starting point is 300–800 tokens for prose with 10–15% overlap, then tuned on an eval set.
Overlap is the cheap insurance against boundary loss: repeat the last ~10–15% of each chunk at the start of the next so a fact straddling the seam survives in at least one chunk. Too much overlap and the index fills with near-duplicates — the same passage retrieved three times, eating top-k slots and context budget.
| Knob | Turn it up → | Turn it down → | The senior weighs |
|---|---|---|---|
| Chunk size | Context intact, but diluted embedding hurts narrow-query recall | Precise embedding, but rules split across boundaries | Match chunk to the answer-bearing unit; tune on evals |
| Overlap | Boundary facts survive; index bloats with duplicates | Lean index; facts on the seam get lost | ~10–15% for prose; less for FAQs |
| Top-k | Higher recall; context bloat, cost, and lost-in-the-middle | Cheap, focused; one miss = no answer | Retrieve wide (k=20–50), rerank down to 3–8 |
| Embedding dims | Better accuracy; larger index, slower ANN, more storage | Faster, cheaper search; some accuracy lost | 1024 dims often ~matches 3072 at 1/3 storage |
Embedding and the vector store
An embedding model maps text to a vector whose geometry encodes meaning — close vectors mean similar meaning. Dimensionality is a real cost lever, not a detail. OpenAI text-embedding-3-small defaults to 1536 dims; text-embedding-3-large to 3072. Bigger vectors generally retrieve more accurately but inflate the index and slow ANN search: roughly, dropping from ~1536 to ~768 dims can cut search latency from around 50ms toward 20ms. With Matryoshka-style truncation you can run text-embedding-3-large at 1024 dims and land near 3072-dim quality at one-third the storage (≈4KB vs ≈12KB per vector) — a senior’s default trade when storage and p99 latency matter.
The vector store doesn’t scan every vector — at million-scale that’s too slow. It uses an ANN index (HNSW is the common choice) that trades a sliver of recall for a huge speedup, answering nearest-neighbour queries in single-digit to low-tens of milliseconds. “Approximate” is the word that bites: the index can silently miss the true nearest neighbour, so the chunk that holds the answer exists but never gets fetched. That miss is invisible — it looks identical to “the answer isn’t in the corpus.”
Top-k, reranking, and the context budget
Top-k retrieval is a recall-vs-noise dial. A small k (3) is cheap and focused but unforgiving: one retrieval miss and the answer simply isn’t in the prompt. A large k (50) almost guarantees the right chunk is somewhere in the set — but now you’ve stuffed the context with 45 irrelevant passages that cost tokens, money, latency, and, worse, distraction.
The senior pattern is two-stage: retrieve wide, rerank narrow. Stage one (the embedding ANN search) optimises recall — cast a wide net, k=20–50 candidates, cheaply. Stage two runs a cross-encoder reranker that reads the query and each candidate together (not as independent vectors) and scores relevance far more accurately, then keeps the top 3–8. Rerankers are slower per item — that’s why you only run them on the shortlist, not the whole corpus.
Whatever survives must fit the context-window budget: the model’s window is finite and shared with the system prompt, the question, and the generation. Even with a 128k-token window, more retrieved text is not free or even neutral — which is where ordering bites.
Why this works
“Lost in the middle”: LLMs show a U-shaped attention bias — they attend best to the start and end of the context and worst to the middle, regardless of relevance. When the answer-bearing chunk lands in the middle of a long stuffed prompt, measured accuracy can drop by 30%+ versus the same chunk at the edges. The practical move: don’t dump 50 chunks in retrieval order. Rerank, keep few, and place the strongest evidence at the very start or very end of the assembled context.
The failure mode that defines RAG in production
The dangerous failure is not a crash — it’s a confident wrong answer. When retrieval misses (ANN miss, bad chunking, the fact never indexed, or it was indexed but is stale), the model still receives some context and is trained to be helpful, so it extrapolates from whatever is near and emits a fluent, specific, wrong answer instead of “I don’t know.” The Q3-churn bug in the Hook is exactly this: nearby-but-wrong chunks, blended with confidence.
Two cousins make it worse. Stale index: the source changed, the embeddings didn’t — a query that worked last Tuesday returns last quarter’s policy today, with zero code changes and no error. Poisoned index: an attacker (or a careless ingest) plants a malicious or contradictory chunk; research shows a single injected passage can flip an answer, and the model won’t flag the contradiction — it picks one and presents it as settled fact. The mitigations are all about admitting ignorance: gate on the retrieval similarity score, instruct the model to answer only from provided context and to say “I don’t know” when nothing clears the bar, and keep the index fresh and access-controlled.
A legal-doc RAG bot must never fabricate citations, and answers can lag the corpus by an hour. Pick the retrieval setup.
In production RAG, a user gets a confident but wrong answer. Where does the fault most often lie?
You retrieve 40 candidate chunks and place the most relevant one in the exact middle of the prompt. What does 'lost in the middle' predict?
Order the RAG query-time pipeline from question to answer:
- 1 Embed the user's question into a query vector
- 2 ANN top-k retrieval: pull the k nearest chunks (wide, recall-first)
- 3 Rerank candidates with a cross-encoder; keep the top few
- 4 Fit survivors into the context budget; order strongest evidence at the edges
- 5 Assemble the prompt and let the LLM generate (grounded, or abstain)
- 01Walk through why retrieval, not generation, is the dominant failure mode in production RAG — and what a retrieval miss actually does.
- 02Explain the two-stage retrieve-wide-then-rerank-narrow pattern, and why a single embedding top-k isn't enough.
RAG is a retrieval problem wearing a generation costume, and almost all production pain is upstream of the model. Chunking sets the ceiling: small chunks embed precisely but split rules across boundaries, large chunks keep context but dilute the embedding and hurt recall — so match the chunk to the answer-bearing unit, use ~10–15% overlap, and tune on evals. Embedding dimensionality is a real cost lever (1536 vs 3072; 1024 often ≈ 3072 at a third the storage), and the ANN index trades a sliver of recall for millisecond search — but “approximate” means it can silently miss the true neighbour. Retrieve wide (k=20–50) then rerank narrow with a cross-encoder to 3–8, fit the context budget, and place the strongest evidence at the edges to dodge lost-in-the-middle, where mid-context accuracy can fall 30%+. The defining failure is a miss that the model papers over with a confident, fluent, wrong answer; stale and poisoned indexes make it worse. The fix is to admit ignorance: gate on the retrieval score, instruct the model to answer only from context and to say “I don’t know,” and keep the index fresh and access-controlled.