AI / LLM Integration AI · 07 · 10

LLM evals: build an eval suite and CI gate

Hands-on project — build a golden-set eval suite plus a calibrated judge and a CI regression gate for one real LLM feature, with measurable pass criteria and an online drift check.

AI Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about evals is not the same as having a gate that blocks your own regression. Take one real LLM feature, build the golden set, score it honestly, calibrate the judge, and wire a CI gate that fails the build when quality drops — then prove it catches a regression you deliberately inject.

Goal

Turn the unit’s model into a working pipeline: a golden set from real inputs, programmatic + calibrated-judge scoring, an offline regression gate in CI with measurable thresholds, and an online drift check — verified by a regression the gate actually blocks.

Project

0 of 7

Objective

Pick one LLM feature (a RAG Q&A endpoint, a classifier, a structured-extraction call, or a summarizer) and ship an eval suite plus a CI regression gate for it, such that a deliberately injected quality regression fails the build — and prove every claim with numbers.

Requirements

Acceptance criteria

A documented golden set of 50+ categorized cases with provenance, plus the per-dimension definition of 'good enough'.
A reported judge-calibration number (agreement with human labels) that clears your stated bar, with the position-bias check shown; if it doesn't clear the bar, the judge is not used as a gate and that decision is documented.
A CI run that PASSES on the baseline and a separate CI run that FAILS on the injected-regression branch, with the gate output showing the per-category delta that triggered the failure (not just an aggregate).
A short write-up: which scorer you chose per dimension and why, the threshold you gated on, and how the online sample would surface a regression the offline gate structurally cannot.

Senior stretch

Add an embedding-distribution drift alert on the online sample: flag when production inputs land meaningfully far from anything in the golden set, and feed those cases back into the set.
Run an A/B test: serve a candidate prompt to a traffic slice, score both arms with the suite, and tie the quality delta to one business metric (resolution rate, escalation rate, etc.).
Add a judge-robustness harness that runs each judged case in both answer orders and reports the position-bias rate, failing CI if it exceeds a threshold.
Make the gate widen automatically: a script that turns each new logged production failure into a golden case (with a category tag) as part of the incident workflow.

Recap

This is the loop you run for every real LLM feature: define the quality contract, build a golden set from real traffic, score with the cheapest honest method (programmatic before a calibrated judge), gate offline in CI on per-category deltas — not the mean — and sample online for the drift offline can’t see. The proof is not a green suite; it’s a suite that turns red on a regression you injected. Build it once on one feature and the production version becomes muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.