AI / LLM Integration
LLM evals: multiple-choice review
Six questions that cut across the whole unit. Each mirrors a call you make shipping a real LLM feature — not a definition to recite, but a tradeoff to weigh when the suite is green and you have to decide whether you actually believe it.
Confirm you can connect golden-set design, scorer choice, judge calibration, and the offline/online split — and spot when a passing eval is still lying to you.
A provider silently swaps your model snapshot overnight. Your repo is unchanged, CI is green, quality drops 15%. Which layer is structurally capable of catching this, and why?
You seed a golden set from the clean examples you used while building the feature. The suite scores 96% but users hit failures. What is the design error?
Your feature returns a JSON object with a required `status` enum plus a free-text `rationale`. What is the right scoring strategy?
An LLM-as-judge returns the identical verdict across ten repeated runs of the same case. A teammate calls it 'calibrated.' Are they right?
In pairwise LLM-as-judge comparison, swapping which candidate answer appears first sometimes flips the verdict. What is this, and the mitigation?
What is the precise mechanism of a regression gate, and what is the senior discipline that makes it tighten over time rather than rot?
The through-line of the unit is one pipeline: a golden set built from real traffic (coverage over count), scored with the cheapest honest method — programmatic where output has structure, a calibrated judge only for open-ended quality — then gated offline in CI and sampled online for drift. The two ways a green suite still lies are a dataset that no longer matches the live distribution and a judge you never validated against humans. Consistency is not accuracy; position, self-preference, and verbosity bias are real; and the discipline that keeps the gate honest is turning every production failure into a golden case the same day.