awesome-everything RU
↑ Back to the climb

AI / LLM Integration

LLM evals: free-recall review

Crux Free-recall prompts across the evals unit. Answer each in your own words first, then reveal the model answer and compare.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the material stick.

Goal

Reconstruct the unit’s spine — why non-determinism breaks ordinary tests, how to build a golden set, when to use programmatic vs judge scoring, how to calibrate a judge, and the offline/online split — without looking back at the lesson.

Recall before you leave
  1. 01
    Why is 'I tried it and it worked' not testing for an LLM feature, and what does an eval assert instead of f(x) === y?
  2. 02
    How do you build a golden set that actually catches regressions, and what is the discipline that keeps it honest?
  3. 03
    When do you use a programmatic check vs an LLM-as-judge, and why prefer programmatic?
  4. 04
    Name the documented LLM-as-judge biases and the one step that makes a judge trustworthy.
  5. 05
    What's the difference between offline and online evaluation, and what does each catch that the other can't?
  6. 06
    Give the two distinct ways an eval suite can be green while real users hit failures, and the defense for each.
Recap

If you could reconstruct each answer from memory, you hold the unit’s spine: non-determinism means an eval scores a distribution of ‘good enough’ rather than asserting one exact value; a golden set is built from real traffic with coverage over count and fed by every production failure; programmatic checks score structure for free while a judge handles only open-ended quality; a judge is biased (position, self-preference, verbosity) and must be calibrated against human labels — consistency is not accuracy; and you gate offline in CI while sampling online for the drift offline can’t see. A green suite still lies two ways — a stale dataset and an uncalibrated judge — so defend against both.

Continue the climb ↑LLM evals: code and harness reading
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.