awesome-everything RU
↑ Back to the climb

AI / LLM Integration

LLM evals: multiple-choice review

Crux Multiple-choice synthesis across the evals unit — golden sets, programmatic vs judge scoring, judge calibration, offline gates vs online sampling, and the green-suite failure modes.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 13 min

Six questions that cut across the whole unit. Each mirrors a call you make shipping a real LLM feature — not a definition to recite, but a tradeoff to weigh when the suite is green and you have to decide whether you actually believe it.

Goal

Confirm you can connect golden-set design, scorer choice, judge calibration, and the offline/online split — and spot when a passing eval is still lying to you.

Quiz

A provider silently swaps your model snapshot overnight. Your repo is unchanged, CI is green, quality drops 15%. Which layer is structurally capable of catching this, and why?

Quiz

You seed a golden set from the clean examples you used while building the feature. The suite scores 96% but users hit failures. What is the design error?

Quiz

Your feature returns a JSON object with a required `status` enum plus a free-text `rationale`. What is the right scoring strategy?

Quiz

An LLM-as-judge returns the identical verdict across ten repeated runs of the same case. A teammate calls it 'calibrated.' Are they right?

Quiz

In pairwise LLM-as-judge comparison, swapping which candidate answer appears first sometimes flips the verdict. What is this, and the mitigation?

Quiz

What is the precise mechanism of a regression gate, and what is the senior discipline that makes it tighten over time rather than rot?

Recap

The through-line of the unit is one pipeline: a golden set built from real traffic (coverage over count), scored with the cheapest honest method — programmatic where output has structure, a calibrated judge only for open-ended quality — then gated offline in CI and sampled online for drift. The two ways a green suite still lies are a dataset that no longer matches the live distribution and a judge you never validated against humans. Consistency is not accuracy; position, self-preference, and verbosity bias are real; and the discipline that keeps the gate honest is turning every production failure into a golden case the same day.

Continue the climb ↑LLM evals: free-recall review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.