awesome-everything RU
↑ Back to the climb

AI / LLM Integration

LLM evals: build an eval suite and CI gate

Crux Hands-on project — build a golden-set eval suite plus a calibrated judge and a CI regression gate for one real LLM feature, with measurable pass criteria and an online drift check.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about evals is not the same as having a gate that blocks your own regression. Take one real LLM feature, build the golden set, score it honestly, calibrate the judge, and wire a CI gate that fails the build when quality drops — then prove it catches a regression you deliberately inject.

Goal

Turn the unit’s model into a working pipeline: a golden set from real inputs, programmatic + calibrated-judge scoring, an offline regression gate in CI with measurable thresholds, and an online drift check — verified by a regression the gate actually blocks.

Project
0 of 7
Objective

Pick one LLM feature (a RAG Q&A endpoint, a classifier, a structured-extraction call, or a summarizer) and ship an eval suite plus a CI regression gate for it, such that a deliberately injected quality regression fails the build — and prove every claim with numbers.

Requirements
Acceptance criteria
  • A documented golden set of 50+ categorized cases with provenance, plus the per-dimension definition of 'good enough'.
  • A reported judge-calibration number (agreement with human labels) that clears your stated bar, with the position-bias check shown; if it doesn't clear the bar, the judge is not used as a gate and that decision is documented.
  • A CI run that PASSES on the baseline and a separate CI run that FAILS on the injected-regression branch, with the gate output showing the per-category delta that triggered the failure (not just an aggregate).
  • A short write-up: which scorer you chose per dimension and why, the threshold you gated on, and how the online sample would surface a regression the offline gate structurally cannot.
Senior stretch
  • Add an embedding-distribution drift alert on the online sample: flag when production inputs land meaningfully far from anything in the golden set, and feed those cases back into the set.
  • Run an A/B test: serve a candidate prompt to a traffic slice, score both arms with the suite, and tie the quality delta to one business metric (resolution rate, escalation rate, etc.).
  • Add a judge-robustness harness that runs each judged case in both answer orders and reports the position-bias rate, failing CI if it exceeds a threshold.
  • Make the gate widen automatically: a script that turns each new logged production failure into a golden case (with a category tag) as part of the incident workflow.
Recap

This is the loop you run for every real LLM feature: define the quality contract, build a golden set from real traffic, score with the cheapest honest method (programmatic before a calibrated judge), gate offline in CI on per-category deltas — not the mean — and sample online for the drift offline can’t see. The proof is not a green suite; it’s a suite that turns red on a regression you injected. Build it once on one feature and the production version becomes muscle memory.

Continue the climb ↑Composing a production LLM app: the bug lives in the seam
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.