AI / LLM Integration AI · 03 · 01

RAG architecture: the pipeline that fails at retrieval, not generation

RAG is a retrieval problem wearing a generation costume. Chunk size, top-k, reranking, and context order each move recall and cost — and a retrieval miss makes the model confidently invent the answer instead of saying it doesn''''t know.

AI Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

A support bot ships on a doc corpus. A user asks “what’s our Q3 churn?” The retriever pulls a chunk with Q2 churn and another with Q3 revenue — close, but not the answer. The model does not say “I can’t find Q3 churn.” It blends the two and emits a confident, specific, wrong number. Nobody catches it for a month because the answer looks right. The generation step was flawless. The retrieval step missed, and a missed retrieval doesn’t fail loudly — it hallucinates fluently.

In the next few minutes you’ll see exactly where that miss originates, why the obvious fixes don’t prevent it, and which knobs actually move the needle — so when your own RAG bot starts hallucinating fluently, you know where to look first.

The pipeline, end to end

RAG (retrieval-augmented generation) is a fixed sequence of stages, and most production pain lives in the early ones. At index time you chunk documents into passages, embed each chunk into a vector (a fixed-length array of floats), and store those vectors in a vector store with an approximate-nearest-neighbour (ANN) index. At query time you embed the question, run top-k retrieval to pull the k closest chunks, optionally rerank them with a more accurate model, fit the survivors into the context-window budget, assemble the prompt, and let the LLM generate.

Two facts reframe everything below. First: the generator can only be as good as what retrieval hands it — garbage chunks in, confident garbage out. Second: in production RAG, the dominant failure is retrieval, not generation. Industry write-ups put the majority of bad answers on the retrieval side: the wrong chunks were fetched, or the right chunk was never indexed. So the engineering effort that pays off is mostly upstream of the model.

Chunking: the size-vs-recall knife-edge

Before you touch the model or the retriever, ask yourself: what is the smallest self-contained unit of meaning in your corpus? That question is chunking, and getting it wrong silently poisons every stage downstream.

Chunking is the decision that quietly caps your ceiling. Embed a chunk and you compress its whole meaning into one vector; the chunk is the atomic unit retrieval can ever return. Get it wrong and no reranker recovers.

The tradeoff is sharp. Small chunks (say 128–256 tokens) embed precisely — one vector, one tight idea — so similarity search targets well. But they fragment any rule that spans a boundary: the condition lands in chunk 7, the exception in chunk 8, and a top-k that grabs only one returns a half-truth. Large chunks (800–1000+ tokens) keep context intact but dilute the embedding — one vector now averages several ideas, so the signal for the specific sub-fact you need gets washed out, hurting recall for narrow queries. A common production starting point is 300–800 tokens for prose with 10–15% overlap, then tuned on an eval set.

Overlap is the cheap insurance against boundary loss: repeat the last ~10–15% of each chunk at the start of the next so a fact straddling the seam survives in at least one chunk. Too much overlap and the index fills with near-duplicates — the same passage retrieved three times, eating top-k slots and context budget.

Knob	Turn it up →	Turn it down →	The senior weighs
Chunk size	Context intact, but diluted embedding hurts narrow-query recall	Precise embedding, but rules split across boundaries	Match chunk to the answer-bearing unit; tune on evals
Overlap	Boundary facts survive; index bloats with duplicates	Lean index; facts on the seam get lost	~10–15% for prose; less for FAQs
Top-k	Higher recall; context bloat, cost, and lost-in-the-middle	Cheap, focused; one miss = no answer	Retrieve wide (k=20–50), rerank down to 3–8
Embedding dims	Better accuracy; larger index, slower ANN, more storage	Faster, cheaper search; some accuracy lost	1024 dims often ~matches 3072 at 1/3 storage

Embedding and the vector store

An embedding model maps text to a vector whose geometry encodes meaning — close vectors mean similar meaning. When you pick a model and configure its dimensionality, you’re not just choosing precision; you’re setting a storage and latency budget you’ll pay on every query. Dimensionality is a real cost lever, not a detail. OpenAI text-embedding-3-small defaults to 1536 dims; text-embedding-3-large to 3072. Bigger vectors generally retrieve more accurately but inflate the index and slow ANN search: roughly, dropping from ~1536 to ~768 dims can cut search latency from around 50ms toward 20ms. With Matryoshka-style truncation you can run text-embedding-3-large at 1024 dims and land near 3072-dim quality at one-third the storage (≈4KB vs ≈12KB per vector) — a senior’s default trade when storage and p99 latency matter.

The vector store doesn’t scan every vector — at million-scale that’s too slow. It uses an ANN index (HNSW is the common choice) that trades a sliver of recall for a huge speedup, answering nearest-neighbour queries in single-digit to low-tens of milliseconds. “Approximate” is the word that bites: the index can silently miss the true nearest neighbour, so the chunk that holds the answer exists but never gets fetched. That miss is invisible — it looks identical to “the answer isn’t in the corpus.”

Top-k, reranking, and the context budget

Top-k retrieval is a recall-vs-noise dial. A small k (3) is cheap and focused but unforgiving: one retrieval miss and the answer simply isn’t in the prompt. A large k (50) almost guarantees the right chunk is somewhere in the set — but now you’ve stuffed the context with 45 irrelevant passages that cost tokens, money, latency, and, worse, distraction.

The senior pattern is two-stage: retrieve wide, rerank narrow. Stage one (the embedding ANN search) optimises recall — cast a wide net, k=20–50 candidates, cheaply. Stage two runs a cross-encoder reranker that reads the query and each candidate together (not as independent vectors) and scores relevance far more accurately, then keeps the top 3–8. Rerankers are slower per item — that’s why you only run them on the shortlist, not the whole corpus.

Whatever survives must fit the context-window budget: the model’s window is finite and shared with the system prompt, the question, and the generation. Even with a 128k-token window, more retrieved text is not free or even neutral — which is where ordering bites.

▸Why this works

“Lost in the middle”: LLMs show a U-shaped attention bias — they attend best to the start and end of the context and worst to the middle, regardless of relevance. When the answer-bearing chunk lands in the middle of a long stuffed prompt, measured accuracy can drop by 30%+ versus the same chunk at the edges. The practical move: don’t dump 50 chunks in retrieval order. Rerank, keep few, and place the strongest evidence at the very start or very end of the assembled context.

The failure mode that defines RAG in production

The dangerous failure is not a crash — it’s a confident wrong answer. When retrieval misses (ANN miss, bad chunking, the fact never indexed, or it was indexed but is stale), the model still receives some context and is trained to be helpful, so it extrapolates from whatever is near and emits a fluent, specific, wrong answer instead of “I don’t know.” The Q3-churn bug in the Hook is exactly this: nearby-but-wrong chunks, blended with confidence.

Two cousins make it worse. Stale index: the source changed, the embeddings didn’t — a query that worked last Tuesday returns last quarter’s policy today, with zero code changes and no error. Poisoned index: an attacker (or a careless ingest) plants a malicious or contradictory chunk; research shows a single injected passage can flip an answer, and the model won’t flag the contradiction — it picks one and presents it as settled fact. The mitigations are all about admitting ignorance: gate on the retrieval similarity score, instruct the model to answer only from provided context and to say “I don’t know” when nothing clears the bar, and keep the index fresh and access-controlled.

Pick the best fit

A legal-doc RAG bot must never fabricate citations, and answers can lag the corpus by an hour. Pick the retrieval setup.

Quiz

In production RAG, a user gets a confident but wrong answer. Where does the fault most often lie?

Quiz

You retrieve 40 candidate chunks and place the most relevant one in the exact middle of the prompt. What does 'lost in the middle' predict?

Order the steps

Order the RAG query-time pipeline from question to answer:

1 Embed the user's question into a query vector
2 ANN top-k retrieval: pull the k nearest chunks (wide, recall-first)
3 Rerank candidates with a cross-encoder; keep the top few
4 Fit survivors into the context budget; order strongest evidence at the edges
5 Assemble the prompt and let the LLM generate (grounded, or abstain)

Most production failures happen upstream of generation: a retrieval miss makes the model invent a confident answer. Retrieve wide (k=20–50), rerank down to a few, fit the context budget, then generate.

Recall before you leave

01
Walk through why retrieval, not generation, is the dominant failure mode in production RAG — and what a retrieval miss actually does.
02
Explain the two-stage retrieve-wide-then-rerank-narrow pattern, and why a single embedding top-k isn't enough.

Recap

RAG is a retrieval problem wearing a generation costume, and almost all production pain is upstream of the model. Chunking sets the ceiling: small chunks embed precisely but split rules across boundaries, large chunks keep context but dilute the embedding and hurt recall — so match the chunk to the answer-bearing unit, use ~10–15% overlap, and tune on evals. Embedding dimensionality is a real cost lever (1536 vs 3072; 1024 often ≈ 3072 at a third the storage), and the ANN index trades a sliver of recall for millisecond search — but “approximate” means it can silently miss the true neighbour. Retrieve wide (k=20–50) then rerank narrow with a cross-encoder to 3–8, fit the context budget, and place the strongest evidence at the edges to dodge lost-in-the-middle, where mid-context accuracy can fall 30%+. The defining failure is a miss that the model papers over with a confident, fluent, wrong answer; stale and poisoned indexes make it worse. The fix is to admit ignorance: gate on the retrieval score, instruct the model to answer only from context and to say “I don’t know,” and keep the index fresh and access-controlled. Now when you see a RAG bot return a confident but wrong answer, your first reflex is to check retrieval — not the model.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Grounded RAG ServiceA RAG demo that answers from a corpus is easy; a RAG service you'd trust in front of users is not. The hard part isn't retrieval, it's grounding: making the model say only what the retrieved text supports, attaching citations the reader can check, and proving with an eval set that the answers don't drift into confident fiction. You'll build the whole loop — chunk, embed, store, retrieve top-k, ground, cite, score — and feel exactly where it leaks.