AI / LLM Integration AI · 07 · 08

LLM evals: free-recall review

Free-recall prompts across the evals unit. Answer each in your own words first, then reveal the model answer and compare.

AI Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the material stick.

Goal

Reconstruct the unit’s spine — why non-determinism breaks ordinary tests, how to build a golden set, when to use programmatic vs judge scoring, how to calibrate a judge, and the offline/online split — without looking back at the lesson.

Recall before you leave

01
Why is 'I tried it and it worked' not testing for an LLM feature, and what does an eval assert instead of f(x) === y?
02
How do you build a golden set that actually catches regressions, and what is the discipline that keeps it honest?
03
When do you use a programmatic check vs an LLM-as-judge, and why prefer programmatic?
04
Name the documented LLM-as-judge biases and the one step that makes a judge trustworthy.
05
What's the difference between offline and online evaluation, and what does each catch that the other can't?
06
Give the two distinct ways an eval suite can be green while real users hit failures, and the defense for each.

Recap

If you could reconstruct each answer from memory, you hold the unit’s spine: non-determinism means an eval scores a distribution of ‘good enough’ rather than asserting one exact value; a golden set is built from real traffic with coverage over count and fed by every production failure; programmatic checks score structure for free while a judge handles only open-ended quality; a judge is biased (position, self-preference, verbosity) and must be calibrated against human labels — consistency is not accuracy; and you gate offline in CI while sampling online for the drift offline can’t see. A green suite still lies two ways — a stale dataset and an uncalibrated judge — so defend against both. Now when you see a teammate skip calibration because “the judge looked consistent,” you know exactly where the trust breaks down — and what to check first.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.