awesome-everything RU
↑ Back to the climb

Performance

Profiler history and microbenchmark pitfalls: Knuth to GWP

Crux From Amdahl 1967 to Google-Wide Profiling 2010, the intellectual lineage explains why modern profiling works the way it does. Naive microbenchmarks are wrong by default — JIT warmup, dead-code elimination, and constant folding all corrupt the measurement.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

JMH’s default warmup for Java is 5 iterations. It is not there for performance — it is there because without it, your benchmark is measuring the interpreter, not the production JIT. Most teams that write microbenchmarks from scratch make this mistake on their first run.

The intellectual lineage: Amdahl to GWP

The profile-first discipline has a 60-year lineage.

Gene Amdahl, 1967 — “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” Showed that the serial fraction of a workload bounds the speedup any number of parallel processors can deliver. The original argument for measuring the fraction before optimising.

Donald Knuth, 1974 — “Structured Programming with Go To Statements”, ACM Computing Surveys. Introduced the “premature optimisation is the root of all evil” framing, with the full sentence naming the 97/3 split: most code does not matter; the critical fraction does, and identifying it is the engineering work.

John Gustafson, 1988 — “Reevaluating Amdahl’s Law.” Argued that as problem sizes scale, the parallelisable fraction grows, so Amdahl’s pessimism understates achievable speedup on real workloads. Gustafson’s law: scaled_speedup = s + p × N, where N is the number of processors and the workload grows with N.

Google Wide Profiling (GWP), 2010 — Ren et al., IEEE Micro. Showed continuous profiling across an entire datacenter at under 0.01% overhead using statistical multiplexing. This technique became Pyroscope, Parca, Polar Signals, and the continuous-profiling category.

The arc: identify the bottleneck (Amdahl 1967) → name it as discipline (Knuth 1974) → refine the model with scale (Gustafson 1988) → make the measurement free and always-on (GWP 2010, Pyroscope 2020s). Every layer added a constraint that is now standard practice.

Microbenchmark pitfalls in JIT runtimes

Naive microbenchmarks are wrong almost by default in JIT-compiled runtimes (JVM, V8, .NET CLR, PyPy).

JIT warmup problem. HotSpot’s tiered compilation: ~10k invocations for C1 (baseline), ~100k for C2 (optimised). A microbench that calls a function 1k times measures the interpreter or baseline JIT, not the optimised code that runs in production. JMH handles this with explicit warmup iterations (default 5 iterations × 10 s each before measurement begins).

Dead-code elimination. If the benchmark’s result is unused, the optimiser deletes the loop body entirely. The benchmark loop runs in microseconds and reports an impossible speedup. Go’s testing.B requires writing the result to a sink package variable (var sinkResult = result). JMH uses Blackhole.consume(result).

Constant folding. If loop inputs are compile-time constants, the optimiser computes the answer once and replaces the loop with a literal. A loop computing md5("fixed-string") 1M times may get folded to a single constant load. Solution: parameterise inputs at runtime from a non-constant data source (JMH @Param, Go benchmark b.ResetTimer() + externally provided data).

Inlining differences. A microbenchmark may inline a function that production would not (or vice versa), because the benchmark’s call tree is simpler. JMH’s @CompilerControl annotations let you force or prevent inlining to match production behaviour.

CPU frequency scaling. Laptop CPUs throttle aggressively: a function’s benchmark time varies by 30% between cool-start and throttled states. Production hardware has different frequency policies. Always benchmark on representative hardware with frequency scaling disabled or at a fixed clock.

PitfallSymptomFix
JIT warmupBenchmark is 3-10x slower than productionExplicit warmup (JMH), B.ResetTimer() after warmup in Go
Dead-code eliminationBenchmark finishes in nanoseconds — suspiciously fastConsume result via sink variable or Blackhole
Constant foldingRuntime invariant across input sizesParameterise inputs at runtime, not compile time
Inlining differencesBench is 2x faster than production@CompilerControl to force/prevent inlining
CPU frequency scalingHigh variance across runs on laptopFix CPU clock, benchmark on server hardware
Why this works

Industry-grade harnesses (JMH for Java, criterion.rs for Rust, Go’s benchstat) standardise warmup, run multiple iterations, and report statistical summaries with variance warnings. Writing a microbenchmark from scratch in a hurry — a hand-rolled timing loop — will exhibit one or more of the pitfalls above. Use the harness; do not reinvent it.

Which RFC?

Where was the canonical 'premature optimisation is the root of all evil' framing introduced, and what is the FULL quote?

Quiz

A microbenchmark runs the target function 500 iterations and reports the mean time per call. The benchmark is a Java method. What is the most likely defect in this setup?

Recall before you leave
  1. 01
    Walk through four microbenchmark pitfalls in JIT runtimes and the fix for each.
  2. 02
    Summarise the intellectual lineage from Amdahl 1967 to Google Wide Profiling 2010 in four steps.
Recap

The profile-first discipline has a 60-year lineage. Amdahl (1967) quantified the ceiling of optimisation. Knuth (1974) named the discipline of identifying the critical 3%. Gustafson (1988) corrected Amdahl for scaling workloads. GWP (2010) made continuous profiling cheap enough for always-on production deployment. Naive microbenchmarks in JIT runtimes measure the wrong thing by default: JIT warmup means the interpreter is running, not optimised code; dead-code elimination removes the loop body; constant folding replaces loops with literals; CPU frequency scaling biases laptop measurements. Industry-grade harnesses (JMH, criterion.rs, Go’s benchstat) handle all of these — use them rather than hand-rolling timing loops.

Connected lessons
appears again in159
Continue the climb ↑Hardware counters, cold-start profiles, and profile security
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.