Crux Read real eval-harness code, a judge prompt and its scoring, a golden-set diff, and a CI regression gate — then pick the behaviour or the highest-leverage fix a senior engineer would make first.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Evals live in code: a harness that scores cases, a judge prompt that returns a verdict, a diff against the last release, and a CI gate that blocks the merge. Read each one and choose what a senior engineer fixes first.
Goal
Practise the loop you run building an eval suite: read the harness, spot where the score lies or the judge is mis-prompted, and reach for the structural fix before tuning thresholds.
Snippet 1 — the eval harness
def run_eval(cases, model): passed = 0 for c in cases: out = model(c["input"]) if out.strip() == c["expected"].strip(): # exact-match scorer passed += 1 return passed / len(cases) # single accuracy number
Quiz
Completed
This harness grades a free-text Q&A feature with exact string match and reports one accuracy number. What is the highest-leverage fix?
Heads-up Retrying doesn't fix the scorer — exact match still rejects valid paraphrases on every attempt. It just multiplies cost. The defect is the scoring method, not the sample count per case.
Heads-up Moving the threshold on a broken scorer hides the problem instead of measuring quality. A judge or programmatic check that actually reflects correctness is the fix, not a looser cutoff on a meaningless number.
Heads-up The shape is fine; the scorer is wrong for the output type, and one global accuracy hides which category regressed. Per-category scores plus an output-appropriate scorer are what make it actionable.
Snippet 2 — the judge prompt and scoring
JUDGE = """Rate the assistant answer from 1-10 for overall quality.Question: {q}Answer: {a}Score:"""def judge_score(q, a, judge_model): text = judge_model(JUDGE.format(q=q, a=a)) # e.g. "I'd say about an 8/10 because..." return int(text.strip()[0]) # take the first char as the score
Quiz
Completed
Two things make this judge untrustworthy as a CI gate. Which pair is correct?
Heads-up Resolution isn't the issue — a finer scale on an unparseable, unrubric'd, unvalidated judge is no more trustworthy. The defects are output parsing and a vague rubric, not the range.
Heads-up Concurrency is a performance concern, not a correctness one. The judge's verdict can be wrong or unparseable whether the call is sync or async.
Heads-up A judge is another stochastic model with documented biases, not ground truth. An uncalibrated, free-form rating parsed by first-character is exactly the 'number generator mistaken for a test' trap.
Snippet 3 — the golden-set diff
eval run: candidate vs main (golden set, 180 cases) overall: main 0.91 -> candidate 0.90 (-0.01) PASS (threshold -0.02)+ category json_format: main 0.98 -> candidate 0.99 (+0.01)- category refusals: main 0.88 -> candidate 0.61 (-0.27)- category long_context: main 0.84 -> candidate 0.71 (-0.13)
Quiz
Completed
The aggregate moved -0.01 and PASSED the gate, but the per-category diff tells a different story. What is the correct read?
Heads-up A -0.27 drop in a real category is a regression users will hit, regardless of the mean. Aggregating over categories is exactly how a green gate ships a broken feature.
Heads-up One category improving doesn't offset two collapsing. Net-positive aggregates routinely hide severe per-category regressions; you must inspect the distribution, not the mean.
Heads-up A tighter aggregate threshold still launders the categories together — the refusals drop could be re-masked by a json_format gain. Gate per category to surface it.
This 'gate' runs the eval on every PR and logs the score. Why is it not actually a gate, and what is the fix?
Heads-up Running post-merge is worse — the regression already landed. You want it on the PR so the bad change is blocked before merge. The defect is that it never fails, not its trigger.
Heads-up jq reads it fine — the score is logged. The job exits 0 regardless because there is no comparison and no failing exit code, so nothing blocks the merge.
Heads-up Manual review of a logged number doesn't scale and isn't enforced — reviewers miss it under load. A gate must fail the build automatically on a regression, which this never does.
Recap
An eval suite is read in code: the scorer must match the output type (exact match silently fails free text); a judge needs a parseable verdict, an explicit rubric and reference, and human-label validation before you trust it; a golden-set diff must be read per category because aggregates launder regressions; and a CI gate is only a gate if it compares to a baseline and exits non-zero to block the merge. Build the harness so the number can’t lie, then gate on the worst category, not the mean.