AI / LLM Integration AI · 07 · 09

LLM evals: code and harness reading

Read real eval-harness code, a judge prompt and its scoring, a golden-set diff, and a CI regression gate — then pick the behaviour or the highest-leverage fix a senior engineer would make first.

AI Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Evals live in code: a harness that scores cases, a judge prompt that returns a verdict, a diff against the last release, and a CI gate that blocks the merge. Read each one and choose what a senior engineer fixes first.

Goal

Practise the loop you run building an eval suite: read the harness, spot where the score lies or the judge is mis-prompted, and reach for the structural fix before tuning thresholds.

Snippet 1 — the eval harness

def run_eval(cases, model):
    passed = 0
    for c in cases:
        out = model(c["input"])
        if out.strip() == c["expected"].strip():   # exact-match scorer
            passed += 1
    return passed / len(cases)                      # single accuracy number

Quiz

This harness grades a free-text Q&A feature with exact string match and reports one accuracy number. What is the highest-leverage fix?

Snippet 2 — the judge prompt and scoring

JUDGE = """Rate the assistant answer from 1-10 for overall quality.
Question: {q}
Answer: {a}
Score:"""

def judge_score(q, a, judge_model):
    text = judge_model(JUDGE.format(q=q, a=a))   # e.g. "I'd say about an 8/10 because..."
    return int(text.strip()[0])                  # take the first char as the score

Quiz

Two things make this judge untrustworthy as a CI gate. Which pair is correct?

Snippet 3 — the golden-set diff

  eval run: candidate vs main (golden set, 180 cases)
  overall:        main 0.91  ->  candidate 0.90   (-0.01)   PASS (threshold -0.02)
+ category json_format:   main 0.98 -> candidate 0.99  (+0.01)
- category refusals:      main 0.88 -> candidate 0.61  (-0.27)
- category long_context:  main 0.84 -> candidate 0.71  (-0.13)

Quiz

The aggregate moved -0.01 and PASSED the gate, but the per-category diff tells a different story. What is the correct read?

Snippet 4 — the CI regression gate

# .github/workflows/eval.yml
on: [pull_request]
jobs:
  eval:
    steps:
      - run: python run_eval.py --set golden --out score.json
      - run: |
          SCORE=$(jq .overall score.json)
          echo "eval score: $SCORE"   # logs the number, job always exits 0

Quiz

This 'gate' runs the eval on every PR and logs the score. Why is it not actually a gate, and what is the fix?

Recap

An eval suite is read in code: the scorer must match the output type (exact match silently fails free text); a judge needs a parseable verdict, an explicit rubric and reference, and human-label validation before you trust it; a golden-set diff must be read per category because aggregates launder regressions; and a CI gate is only a gate if it compares to a baseline and exits non-zero to block the merge. Build the harness so the number can’t lie, then gate on the worst category, not the mean.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.