AI / LLM Integration
LLM evals: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the material stick.
Reconstruct the unit’s spine — why non-determinism breaks ordinary tests, how to build a golden set, when to use programmatic vs judge scoring, how to calibrate a judge, and the offline/online split — without looking back at the lesson.
- 01Why is 'I tried it and it worked' not testing for an LLM feature, and what does an eval assert instead of f(x) === y?
- 02How do you build a golden set that actually catches regressions, and what is the discipline that keeps it honest?
- 03When do you use a programmatic check vs an LLM-as-judge, and why prefer programmatic?
- 04Name the documented LLM-as-judge biases and the one step that makes a judge trustworthy.
- 05What's the difference between offline and online evaluation, and what does each catch that the other can't?
- 06Give the two distinct ways an eval suite can be green while real users hit failures, and the defense for each.
If you could reconstruct each answer from memory, you hold the unit’s spine: non-determinism means an eval scores a distribution of ‘good enough’ rather than asserting one exact value; a golden set is built from real traffic with coverage over count and fed by every production failure; programmatic checks score structure for free while a judge handles only open-ended quality; a judge is biased (position, self-preference, verbosity) and must be calibrated against human labels — consistency is not accuracy; and you gate offline in CI while sampling online for the drift offline can’t see. A green suite still lies two ways — a stale dataset and an uncalibrated judge — so defend against both.