Performance PERF · 01 · 09

Profile first: code and trace reading

Read a microbenchmark, a profiler diff, a perf hardware-counter readout, and an Amdahl calculation, then predict the behaviour and pick the highest-leverage fix.

PERF Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Benchmarks, profiler diffs, and counter readouts are where profile-first lives or dies. Read each artefact and choose the move a senior engineer makes first — before touching a tuning knob or trusting a headline number.

Goal

Practise the loop you run in every performance investigation: read the benchmark or trace, spot the lie or the signal, and reach for the highest-leverage interpretation before changing anything.

Snippet 1 — the benchmark that is too fast

func BenchmarkHash(b *testing.B) {
    data := []byte("fixed-input-string")
    for i := 0; i < b.N; i++ {
        _ = fnv32(data)   // result discarded
    }
}
// reported: 0.31 ns/op — faster than a single memory load

Quiz

0.31 ns/op is below the cost of one L1 load. What happened, and what is the fix?

Snippet 2 — the profiler diff vs the production metric

# Production (5-min window, after deploy)
checkout_p99_ms        580   (prev 820)   # 29% faster
cpu_pct                 62   (prev 58)    # CPU went UP

# go tool pprof -diff_base baseline.cpu prod.cpu
Showing nodes accounting for -3.20s, 1.15% of -278.5s total
      flat  flat%        cum   cum%
    -1.80s  0.64%     -1.80s  0.64%  net/http.(*conn).serve
    -1.40s  0.50%     -1.40s  0.50%  encoding/json.Marshal

Quiz

p99 dropped 29% but the CPU profile diff shows only ~1% net CPU change and CPU% rose. How do you reconcile this, and what do you capture next?

Snippet 3 — the hardware-counter readout

$ perf stat -e cycles,instructions,cache-misses ./svc bench-json
   142,310,884,001  cycles
    61,994,210,773  instructions     #  0.44  insn per cycle
     1,902,544,118  cache-misses
# flame graph: parseJSON is 35% of CPU, single wide leaf

Quiz

parseJSON is 35% of CPU. IPC is 0.44 with a huge cache-miss count. What fix is most likely to help, and which one will not?

Snippet 4 — the Amdahl decision

# Request total: 200 ms. Profile shows:
#   funcA  100 ms (50%)   -- option 1: rewrite for 2x  -> 50 ms saved
#   funcB   40 ms (20%)   -- option 2: rewrite B and C for 4x each
#   funcC   20 ms (10%)   --            -> 45 ms saved combined

Quiz

You can do option 1 OR option 2, not both. Which delivers more total speedup, and what is the general rule?

Recap

Every artefact here hides a trap or a signal. A benchmark whose result is discarded measures eliminated code, not work. A 29% p99 win with a flat CPU diff means the saving was off-CPU — capture a wait profile to see it. IPC under 1.0 with heavy cache-misses means memory-stalled: fix the data layout, not the algorithm. And when two optimisations compete, Amdahl’s arithmetic on shares — not the size of the local multiplier — picks the winner. Read the number, find the lie, then act on the share.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.