AI / LLM Integration AI · 07 · 01

LLM evals: the regression test for non-deterministic features

You can''''t ship an LLM feature without evals: the same input gives different output, and a model or prompt change silently regresses quality. Golden sets, programmatic checks, and a validated judge are how you catch it before users do.

AI Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

A provider bumps your model from one snapshot to the next overnight — same model name, same prompt, same temperature. Nothing in your repo changed, so CI is green and you deploy on Friday. By Monday support is flooded: the assistant now answers in markdown tables the parser chokes on, and refuses a class of valid questions it used to handle. There was no exception, no failed test, no log line that said “worse.” The only signal was customers, three days late. The thing that would have caught it on Friday — a suite that re-ran your real prompts and scored the output against known-good answers — was never written, because “it worked when I tried it.”

Why “I tried it and it worked” is not testing

A normal unit test asserts f(x) === y. An LLM does not give you that contract: the same prompt at the same temperature can return different text on each call, and even at temperature: 0 you are not guaranteed identical output across model versions or even across requests, because providers batch and route non-deterministically. So the assertion you actually want is not “output equals this exact string” but “output is good enough on the dimensions I care about” — correct, on-format, on-policy, grounded in the retrieved context.

That shift breaks every habit from deterministic testing. You can’t pin one expected value, so you score a distribution of behavior over many examples. You can’t trust a single green run, because the next run may differ. And the failure mode is silent: nothing throws. A model swap, a prompt edit, a new system message, a changed retrieval index — any of these can quietly drop your answer quality by 15% with zero error signal. Evals are the only mechanism that turns “quietly worse” into a number that fails CI.

The golden dataset: real traffic, not happy paths

An eval is only as honest as its dataset. The golden set is a curated collection of inputs paired with known-good outputs (or with a rubric, when there’s no single right answer). The senior mistake is to seed it with the examples you used while building — the easy, clean, on-distribution ones. Those are exactly the cases that already work. A useful golden set is built from production: real user queries, deliberately oversampling edge cases, past failures, and high-value flows. The discipline that pays for itself: every production failure becomes a golden case the same day, so a fix is provable and the same bug can never regress silently again. When you triage a production incident, ask yourself: is there already a golden case for this failure? If not, add one before closing the ticket.

Size is a judgment call, not a fixed number. Teams often start useful at 50–200 well-chosen cases and grow into the thousands as categories accumulate. Coverage beats raw count: 80 cases that span your real query categories catch more than 2,000 near-duplicate easy ones. The number that actually matters is not dataset size — it’s how much of your live input distribution the set represents, which is the failure mode we return to below.

Exact checks vs LLM-as-judge

Score each case with the cheapest method that works. Two families:

Scorer	Cost / speed	Flaky?	Use when
Exact / programmatic (regex, JSON-match, schema, code exec)	~free, ms	No — deterministic	Output has structure: valid JSON, a number, a class label, parseable format
LLM-as-judge (a model grades the output against a rubric)	1 extra API call/case, seconds	Yes — noisy, biased	Open-ended quality: helpfulness, tone, faithfulness, “did it answer”

Prefer programmatic checks wherever the output has any structure — they’re free, instant, and never lie. Reach for a judge only for the genuinely subjective dimensions. A judge means you make one extra model call per case to grade it: hand the judge the input, the output, and a rubric, and have it return a parseable verdict (a yes/no or a multiple-choice score), exactly as OpenAI’s model-graded eval template prescribes.

The judge is itself a noisy instrument — calibrate it

Here is the trap that burns teams: an LLM judge is not a ground truth, it is another stochastic model with documented biases. A well-built judge can reach roughly 80% agreement with human raters — about the level two humans agree with each other — but only when calibrated; an uncalibrated one drifts well below that. The known failure modes are specific and reproducible:

Position bias. In pairwise comparison, swapping which answer comes first can flip the verdict. Studies of GPT-4 as judge show its preference reversing when answer positions are swapped; the judge model itself is the largest driver of this bias, more than task difficulty or output length. Mitigate by running both orders and averaging.
Self-preference. A model tends to rate its own generations higher, with a measured linear correlation between how well it recognizes its own text and how much it favors it. Don’t grade GPT-4 output with a GPT-4 judge if you can avoid it.
Verbosity bias. Judges systematically prefer longer answers, independent of correctness.

The non-negotiable mechanism: validate the judge against human labels before you trust it. OpenAI’s own guidance is to add a meta-eval with human-provided labels to check the model-graded eval. Hand-label a sample, measure the judge’s agreement against those labels, and only ship the judge once it clears your bar (teams target ~75–90% agreement). A judge you never measured is a number generator you’ve mistaken for a test.

▸Why this works

“The judge agreed with itself across runs” is not calibration — it’s just the judge being consistently wrong in the same direction. Consistency (low variance) and accuracy (agreement with humans) are different axes. A confidently biased judge is more dangerous than a noisy one, because its stability reads as trustworthiness while it steadily green-lights regressions humans would have caught.

Offline gates vs online eval

Two places evals run, and you need both:

Offline — the eval suite runs in CI against the golden set on every prompt edit, model bump, or retrieval change. It produces a score; a regression gate fails the build if the score drops below baseline (or below the previous release). This is what would have blocked the Friday deploy. Add new production failures to the golden set so the gate widens over time.
Online — you sample a slice of real production traffic (say 1–5%), score it with the same automated methods, and watch the trend. Online catches what offline structurally cannot: a silent provider model update, a shift in what users are actually asking, retrieval index drift. A/B tests compare a candidate against the live version on real users and tie quality to business metrics.

Order the steps

Order the lifecycle of catching an LLM regression before users do:

1 Build a golden set from real production traffic, oversampling edge cases and past failures
2 Score each case: programmatic check where output has structure, LLM-judge for open-ended quality
3 Validate the judge against human labels; only trust it once agreement clears your bar
4 Wire the suite as a CI regression gate that fails the build if the score drops below baseline
5 Sample 1-5% of production online to catch provider drift and input-distribution shift the golden set misses

The failure that survives a green suite

The worst outcome is not a failing eval — it’s an eval suite that is green while real users hit failures. It happens two ways, and both are about trusting a number you shouldn’t.

The first: the golden set doesn’t cover the live distribution. Your suite passes at 95% because it’s full of the queries you anticipated, while production traffic has drifted toward a category you never sampled — a new language, a new question type, a new document format. The eval is measuring a world that no longer matches the one users live in. The defense is online monitoring: track the embedding distribution of real inputs and alert when production queries land meaningfully far from anything in the golden set, then pull those into it.

The second: you trusted a noisy judge. The suite is green because a biased judge keeps scoring a verbose, confidently-wrong answer as “good.” You shipped a number-generator’s opinion, not a measurement. The defense is the calibration step above — and treating a sudden jump in judge scores after a prompt change as a reason to re-check the judge, not to celebrate.

Quiz

A provider silently updates your model snapshot. Your CI eval suite is green. What most likely caught — or missed — the regression?

Quiz

You add an LLM judge and its scores look great. Before trusting it as a gate, the senior move is to:

Pick the best fit

Your feature returns a structured JSON object with a required status field and a free-text rationale. Pick the scoring strategy for the eval.

Build a golden set from real traffic, score each case (programmatic check where output has structure, a calibrated LLM-judge only for open-ended quality), validate that judge against human labels, wire the suite as a CI regression gate, then sample 1–5% of production online to catch provider drift and distribution shift the golden set misses.

Recall before you leave

01
A teammate says 'the eval suite is green, so the new prompt is safe to ship.' Give two distinct ways that green suite can still be hiding a real regression users will hit.
02
Why is an LLM-as-judge cheaper and faster to build than human review but riskier than a programmatic check, and what's the one step that makes it trustworthy?

Recap

An LLM feature with no eval suite is untested code, because non-determinism means the same prompt can return different output and a model or prompt change can quietly drop quality with no error, no failed test, and no log line — only customers, days late. Evals turn that silent regression into a number CI can fail. The honest version starts with a golden set built from real production traffic, oversampling edge cases and turning every production failure into a golden case the same day; coverage of the live distribution matters more than raw size. Score each case with the cheapest method that works: programmatic checks (regex, JSON-match, schema, code execution) wherever output has structure, since they’re free and never flaky, and an LLM-as-judge only for open-ended quality. But the judge is itself a noisy, biased instrument — position bias, self-preference, verbosity bias — so validate it against human labels and only trust it once agreement clears your bar (~75-90%, near human-human agreement). Run the suite offline as a CI regression gate that blocks a deploy when the score drops, and sample 1-5% of production online to catch the silent provider update and the input-distribution drift offline structurally can’t see. The two ways a green suite still lies: a golden set that no longer matches the live distribution, and a noisy judge you trusted without calibration — defend against both, or you ship regressions with a passing test. Now when you see a CI eval suite turn green on a Friday deploy, the first question to ask is not “did it pass?” but “does our golden set still cover what users are actually sending, and did we measure the judge against human labels?”

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Grounded RAG ServiceA RAG demo that answers from a corpus is easy; a RAG service you'd trust in front of users is not. The hard part isn't retrieval, it's grounding: making the model say only what the retrieved text supports, attaching citations the reader can check, and proving with an eval set that the answers don't drift into confident fiction. You'll build the whole loop — chunk, embed, store, retrieve top-k, ground, cite, score — and feel exactly where it leaks.