awesome-everything RU
↑ Back to the climb

Performance

Statistical baselines: why one run is not a measurement

Crux A single benchmark run cannot distinguish a real 15% speedup from natural noise. Report median, p95, p99 across multiple runs with confidence intervals — anything less is a guess wearing a number.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 12 min

“This PR makes the service 15% faster” — based on one staging run, posted to the review. Is it real? The noise floor in most systems is 5-10% between identical runs. Without more runs and percentiles, you cannot tell.

Why a single run is unreliable

The same workload run twice may differ by 5-10% from sources unrelated to the code:

  • GC timing differences (stop-the-world windows shift between runs)
  • OS scheduler decisions (which CPU core, when preempted)
  • Cache state at run start (L3 cold vs warm)
  • Transient network or disk contention
  • CPU frequency scaling on laptops (throttling under thermal load)

A “15% speedup” from one run cannot be distinguished from this noise. It might be a real win, it might be a coincidence, you cannot tell.

What to report instead

  • Minimum runs: 5-10 for macrobench, 30+ for short microbench.
  • Report: median, p95, and p99 — not mean. Mean is dominated by outliers; median is robust to them.
  • Include spread: standard deviation or confidence interval.
  • What a defensible result looks like: “p99 latency dropped from 480 ms to 320 ms across five 5-minute runs, p50 from 60 ms to 55 ms, with 95% CI of ±15 ms on each measurement.”

A “10% speedup” is real only if median improved by ≥10% AND p95/p99 did not regress.

Reported asProblemBetter alternative
Single run, single numberIndistinguishable from 5-10% natural noise5+ runs, report median + p99 + CI
Mean latencyOne outlier skews the averageMedian + p95/p99 percentiles
Staging-only measurementSynthetic load may not match productionStaging + production canary confirmation

Why mean is the wrong metric

Latency distributions have a long tail. A service that handles 99% of requests in 50 ms but 1% in 2000 ms has a mean around 70 ms — which hides the 2000 ms tail that users experience. Mean is dominated by outliers in either direction. Median (p50) is robust to outliers; p99 captures the tail experience.

The discipline: report numbers like a scientist, with distributions and uncertainty — not like a marketer, with single percentages.

Why this works

The discipline pays off most when a claimed “win” turns out to be noise. A team that catches a noise win before merging saves: the PR churn of reverting a “regression” later when production shows no improvement, the time spent writing a postmortem for a deploy that did nothing, and the loss of trust when the team learns perf numbers cannot be believed without methodology.

Quiz

A team reports 'this change made the service 15% faster' based on one staging run. What is the most important follow-up question?

Quiz

Why is mean latency the wrong metric for reporting service performance?

Quiz

You run a microbenchmark 1000 times and compute the mean. It shows a 12% improvement. What additional information do you need before claiming the improvement is real?

Recall before you leave
  1. 01
    Why is reporting a single 'X% speedup' number from one benchmark run a red flag, and what should be reported instead?
  2. 02
    What are the four main sources of noise in performance measurements, and which can be controlled?
Recap

Performance measurements are distributions, not single numbers. The natural noise floor is 5-10% between identical runs of the same code, from GC timing, OS scheduler, cache state, and transient contention. A single-run “15% speedup” is statistically indistinguishable from noise. Defensible reporting requires median, p95, and p99 across at least five runs with explicit confidence intervals. Mean latency is the wrong metric because it is dominated by outliers; median and tail percentiles correctly describe user experience. The discipline of reporting ranges instead of single values catches noise wins before they are merged and builds the team’s trust in performance claims.

Connected lessons
appears again in159
Continue the climb ↑Profiler history and microbenchmark pitfalls: Knuth to GWP
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.