AI / LLM Integration AI · 07 · 07

LLM evals: multiple-choice review

Multiple-choice synthesis across the evals unit — golden sets, programmatic vs judge scoring, judge calibration, offline gates vs online sampling, and the green-suite failure modes.

AI Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each mirrors a call you make shipping a real LLM feature — not a definition to recite, but a tradeoff to weigh when the suite is green and you have to decide whether you actually believe it.

Goal

Confirm you can connect golden-set design, scorer choice, judge calibration, and the offline/online split — and spot when a passing eval is still lying to you.

Quiz

A provider silently swaps your model snapshot overnight. Your repo is unchanged, CI is green, quality drops 15%. Which layer is structurally capable of catching this, and why?

Quiz

You seed a golden set from the clean examples you used while building the feature. The suite scores 96% but users hit failures. What is the design error?

Quiz

Your feature returns a JSON object with a required `status` enum plus a free-text `rationale`. What is the right scoring strategy?

Quiz

An LLM-as-judge returns the identical verdict across ten repeated runs of the same case. A teammate calls it 'calibrated.' Are they right?

Quiz

In pairwise LLM-as-judge comparison, swapping which candidate answer appears first sometimes flips the verdict. What is this, and the mitigation?

Quiz

What is the precise mechanism of a regression gate, and what is the senior discipline that makes it tighten over time rather than rot?

Recap

The through-line of the unit is one pipeline: a golden set built from real traffic (coverage over count), scored with the cheapest honest method — programmatic where output has structure, a calibrated judge only for open-ended quality — then gated offline in CI and sampled online for drift. The two ways a green suite still lies are a dataset that no longer matches the live distribution and a judge you never validated against humans. Consistency is not accuracy; position, self-preference, and verbosity bias are real; and the discipline that keeps the gate honest is turning every production failure into a golden case the same day. Now when you see a green eval suite before a deploy, you know to ask: does this golden set still match live traffic, and was the judge calibrated against human labels — or am I looking at a number generator with a pass threshold?

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.