AI / LLM Integration AI · 03 · 10

RAG architecture: build and evaluate a retrieval pipeline

Hands-on project — build a RAG pipeline on a real corpus, then measure retrieval quality, grounding, and latency/cost, and prove each improvement with before/after numbers on a held-out eval set.

AI Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about retrieval misses is not the same as catching one on your own corpus. Build a RAG pipeline end to end, then drive it with an eval set until the numbers — recall, grounding, latency, cost — tell you it actually retrieves before it answers, instead of confidently inventing.

Goal

Turn the unit’s mental model into a measurable engineering loop: chunk and index a corpus, retrieve wide then rerank narrow, assemble context that dodges lost-in-the-middle, gate on score to abstain, and prove every tuning decision with before/after metrics — not vibes.

Project

0 of 7

Objective

Build a working RAG pipeline over a real document corpus and tune it against a held-out evaluation set so that retrieval recall, answer grounding, and p99 latency/cost all hit target — with the confident-hallucination failure provably contained by an abstain gate.

Requirements

Acceptance criteria

A before/after table on the SAME eval set: recall@k, final-stage precision/MRR, grounding/faithfulness rate, abstain rate on out-of-corpus questions, p99 latency, and per-query cost — measured, not estimated.
At least one not-in-corpus question is correctly abstained on ('I don't know') because the retrieval/rerank score fell below the gate — demonstrating the confident-hallucination defense works.
A latency/cost breakdown by stage showing where the time and tokens go, with the dimensionality or k decision justified by its measured effect on recall vs p99.
A short write-up naming which pipeline stage each tuning change targeted (chunking, embedding, retrieval, rerank, assembly, gate) and why that stage was the highest-leverage move for the metric it improved.

Senior stretch

Add hybrid retrieval (BM25/keyword + dense vectors with score fusion) and measure whether it lifts recall on exact-term and acronym queries where pure embeddings miss.
Inject a poisoned/contradictory chunk into the corpus and show your pipeline either abstains, surfaces the conflict, or is hardened by access-control/source-trust filtering — then quantify the impact on grounding.
Add a freshness/re-index job and a test that proves a document update is reflected in answers within the target lag, closing the stale-index failure mode.
Run an ablation on context ordering (best-at-edges vs raw retrieval order vs decisive-chunk-in-middle) on long contexts and quantify the lost-in-the-middle accuracy drop on your own eval set.

Recap

This is the loop you’ll run on every real RAG system: index with deliberate chunk size and overlap, retrieve wide then rerank narrow, assemble with best evidence at the edges, gate on score so a miss abstains instead of hallucinating, and verify on a held-out eval set with recall, grounding, latency, and cost measured before and after each change. Building and evaluating it once on a real corpus makes the production version — where a silent retrieval miss costs trust — muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.