AI / LLM Integration
LLM evals: build an eval suite and CI gate
Reading about evals is not the same as having a gate that blocks your own regression. Take one real LLM feature, build the golden set, score it honestly, calibrate the judge, and wire a CI gate that fails the build when quality drops — then prove it catches a regression you deliberately inject.
Turn the unit’s model into a working pipeline: a golden set from real inputs, programmatic + calibrated-judge scoring, an offline regression gate in CI with measurable thresholds, and an online drift check — verified by a regression the gate actually blocks.
Pick one LLM feature (a RAG Q&A endpoint, a classifier, a structured-extraction call, or a summarizer) and ship an eval suite plus a CI regression gate for it, such that a deliberately injected quality regression fails the build — and prove every claim with numbers.
- A documented golden set of 50+ categorized cases with provenance, plus the per-dimension definition of 'good enough'.
- A reported judge-calibration number (agreement with human labels) that clears your stated bar, with the position-bias check shown; if it doesn't clear the bar, the judge is not used as a gate and that decision is documented.
- A CI run that PASSES on the baseline and a separate CI run that FAILS on the injected-regression branch, with the gate output showing the per-category delta that triggered the failure (not just an aggregate).
- A short write-up: which scorer you chose per dimension and why, the threshold you gated on, and how the online sample would surface a regression the offline gate structurally cannot.
- Add an embedding-distribution drift alert on the online sample: flag when production inputs land meaningfully far from anything in the golden set, and feed those cases back into the set.
- Run an A/B test: serve a candidate prompt to a traffic slice, score both arms with the suite, and tie the quality delta to one business metric (resolution rate, escalation rate, etc.).
- Add a judge-robustness harness that runs each judged case in both answer orders and reports the position-bias rate, failing CI if it exceeds a threshold.
- Make the gate widen automatically: a script that turns each new logged production failure into a golden case (with a category tag) as part of the incident workflow.
This is the loop you run for every real LLM feature: define the quality contract, build a golden set from real traffic, score with the cheapest honest method (programmatic before a calibrated judge), gate offline in CI on per-category deltas — not the mean — and sample online for the drift offline can’t see. The proof is not a green suite; it’s a suite that turns red on a regression you injected. Build it once on one feature and the production version becomes muscle memory.