Crux Read a microbenchmark, a profiler diff, a perf hardware-counter readout, and an Amdahl calculation, then predict the behaviour and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Benchmarks, profiler diffs, and counter readouts are where profile-first lives or dies. Read each artefact and choose the move a senior engineer makes first — before touching a tuning knob or trusting a headline number.
Goal
Practise the loop you run in every performance investigation: read the benchmark or trace, spot the lie or the signal, and reach for the highest-leverage interpretation before changing anything.
Snippet 1 — the benchmark that is too fast
func BenchmarkHash(b *testing.B) { data := []byte("fixed-input-string") for i := 0; i < b.N; i++ { _ = fnv32(data) // result discarded }}// reported: 0.31 ns/op — faster than a single memory load
Quiz
Completed
0.31 ns/op is below the cost of one L1 load. What happened, and what is the fix?
Heads-up 0.31 ns/op is below a single L1 cache load (~1 ns); no real hash of a byte slice runs that fast. The compiler removed the work because the result is unused — the benchmark measures nothing.
Heads-up testing.B grows b.N until the run is long enough; precision is not the issue. The issue is dead-code elimination removing the body entirely, plus a constant input the optimiser can fold.
Heads-up A profiler does not fix a benchmark that measures eliminated code. The defect is in the benchmark: consume the result and use a non-constant input so the work actually runs.
Snippet 2 — the profiler diff vs the production metric
# Production (5-min window, after deploy)checkout_p99_ms 580 (prev 820) # 29% fastercpu_pct 62 (prev 58) # CPU went UP# go tool pprof -diff_base baseline.cpu prod.cpuShowing nodes accounting for -3.20s, 1.15% of -278.5s total flat flat% cum cum% -1.80s 0.64% -1.80s 0.64% net/http.(*conn).serve -1.40s 0.50% -1.40s 0.50% encoding/json.Marshal
Quiz
Completed
p99 dropped 29% but the CPU profile diff shows only ~1% net CPU change and CPU% rose. How do you reconcile this, and what do you capture next?
Heads-up Wall-clock latency is not only CPU. For an I/O- or lock-bound service the saving is off-CPU wait time, which never appears in a CPU profile. The metrics and the tiny CPU diff are perfectly consistent.
Heads-up Higher CPU after a fix is often a good signal: the service does real work instead of waiting. p99 improved 29% — judge against the SLO, not the CPU number in isolation.
Heads-up The missing time was never on-CPU, so no sampling rate will surface it in a CPU profile. You need an off-CPU profile to see DB-wait or mutex-wait frames.
Snippet 3 — the hardware-counter readout
$ perf stat -e cycles,instructions,cache-misses ./svc bench-json 142,310,884,001 cycles 61,994,210,773 instructions # 0.44 insn per cycle 1,902,544,118 cache-misses# flame graph: parseJSON is 35% of CPU, single wide leaf
Quiz
Completed
parseJSON is 35% of CPU. IPC is 0.44 with a huge cache-miss count. What fix is most likely to help, and which one will not?
Heads-up IPC 0.44 says the function is memory-stalled, not compute-bound. A different parser that still chases the same scattered pointers sees the same stalls. The lever is data layout, not the algorithm.
Heads-up Vectorisation helps compute-bound code (high IPC). At IPC 0.44 the CPU is idle waiting for memory; SIMD has nothing to accelerate until the cache-misses drop.
Heads-up The readout flags cache-misses, not branch-misses. Low IPC plus high cache-miss points squarely at memory stalls; check the branch-miss counter separately before assuming it.
Snippet 4 — the Amdahl decision
# Request total: 200 ms. Profile shows:# funcA 100 ms (50%) -- option 1: rewrite for 2x -> 50 ms saved# funcB 40 ms (20%) -- option 2: rewrite B and C for 4x each# funcC 20 ms (10%) -- -> 45 ms saved combined
Quiz
Completed
You can do option 1 OR option 2, not both. Which delivers more total speedup, and what is the general rule?
Heads-up Magnitude of local speedup is not the deciding factor; share is. The 4x lands on only 30% of time, saving 45 ms, while the 2x on 50% saves 50 ms. Amdahl, not the multiplier, decides.
Heads-up They are close but not equal: 50 ms vs 45 ms saved, 1.33x vs 1.29x. The point is precisely that you must do the Amdahl arithmetic instead of eyeballing the multipliers.
Heads-up The profile already gave you the shares (p) and the proposed local speedups (s); that is everything Amdahl needs. A microbenchmark would only re-measure s in isolation and could not improve this decision.
Recap
Every artefact here hides a trap or a signal. A benchmark whose result is discarded measures eliminated code, not work. A 29% p99 win with a flat CPU diff means the saving was off-CPU — capture a wait profile to see it. IPC under 1.0 with heavy cache-misses means memory-stalled: fix the data layout, not the algorithm. And when two optimisations compete, Amdahl’s arithmetic on shares — not the size of the local multiplier — picks the winner. Read the number, find the lie, then act on the share.