Performance PERF · 01 · 05

Statistical baselines: why one run is not a measurement

A single benchmark run cannot distinguish a real 15% speedup from natural noise. Report median, p95, p99 across multiple runs with confidence intervals — anything less is a guess wearing a number.

PERF Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

“This PR makes the service 15% faster” — based on one staging run, posted to the review. Is it real? The noise floor in most systems is 5-10% between identical runs. Without more runs and percentiles, you cannot tell.

Why a single run is unreliable

Ask yourself: if your colleague claimed a 15% speedup based on one staging run, would you trust it enough to ship? The answer reveals why statistical rigour is not optional here.

The same workload run twice may differ by 5-10% from sources unrelated to the code:

GC timing differences (stop-the-world windows shift between runs)
OS scheduler decisions (which CPU core, when preempted)
Cache state at run start (L3 cold vs warm)
Transient network or disk contention
CPU frequency scaling on laptops (throttling under thermal load)

A “15% speedup” from one run cannot be distinguished from this noise. It might be a real win, it might be a coincidence, you cannot tell.

What to report instead

Minimum runs: 5-10 for macrobench, 30+ for short microbench.
Report: median, p95, and p99 — not mean. Mean is dominated by outliers; median is robust to them.
Include spread: standard deviation or confidence interval.
What a defensible result looks like: “p99 latency dropped from 480 ms to 320 ms across five 5-minute runs, p50 from 60 ms to 55 ms, with 95% CI of ±15 ms on each measurement.”

A “10% speedup” is real only if median improved by ≥10% AND p95/p99 did not regress.

A defensible claim reports both p50 and p99 across runs. Here the real win lives in the tail (p99 480→320 ms); the median moves only 60→55, so a median-only report would have buried it.

Reported as	Problem	Better alternative
Single run, single number	Indistinguishable from 5-10% natural noise	5+ runs, report median + p99 + CI
Mean latency	One outlier skews the average	Median + p95/p99 percentiles
Staging-only measurement	Synthetic load may not match production	Staging + production canary confirmation

Why mean is the wrong metric

Latency distributions have a long tail. A service that handles 99% of requests in 50 ms but 1% in 2000 ms has a mean around 70 ms — which hides the 2000 ms tail that users experience. Mean is dominated by outliers in either direction. Median (p50) is robust to outliers; p99 captures the tail experience.

The discipline: report numbers like a scientist, with distributions and uncertainty — not like a marketer, with single percentages.

▸Why this works

The discipline pays off most when a claimed “win” turns out to be noise. A team that catches a noise win before merging saves: the PR churn of reverting a “regression” later when production shows no improvement, the time spent writing a postmortem for a deploy that did nothing, and the loss of trust when the team learns perf numbers cannot be believed without methodology.

Quiz

A team reports 'this change made the service 15% faster' based on one staging run. What is the most important follow-up question?

Quiz

Why is mean latency the wrong metric for reporting service performance?

Quiz

You run a microbenchmark 1000 times and compute the mean. It shows a 12% improvement. What additional information do you need before claiming the improvement is real?

Latency is a long-tailed distribution. The mean is dragged right by outliers; report p50 (typical) and p99 (tail), not the mean.

Recall before you leave

01
Why is reporting a single 'X% speedup' number from one benchmark run a red flag, and what should be reported instead?
02
What are the four main sources of noise in performance measurements, and which can be controlled?

Recap

Performance measurements are distributions, not single numbers. The natural noise floor is 5-10% between identical runs of the same code, from GC timing, OS scheduler, cache state, and transient contention. A single-run “15% speedup” is statistically indistinguishable from noise. Defensible reporting requires median, p95, and p99 across at least five runs with explicit confidence intervals. Mean latency is the wrong metric because it is dominated by outliers; median and tail percentiles correctly describe user experience. The discipline of reporting ranges instead of single values catches noise wins before they are merged and builds the team’s trust in performance claims. Now when you see a PR description claiming a speedup from one run, you know the first question: how wide is the confidence interval across multiple runs?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

The measurement loop: microbench, macrobench, prod profile, observer effectmiddle

unlocks

Profiler history and microbenchmark pitfalls: Knuth to GWPsenior

deepens into

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.