Crux Read real pprof output, a collapsed flame graph, an allocation profile, and a profiler config — predict the behaviour and pick the highest-leverage diagnosis.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Profiles and configs are where profiling problems are actually diagnosed. Read the output and the setup, then choose the read a senior engineer would commit to before touching anything else.
Goal
Practise the loop you run in every profiling incident: read the profile or config, infer what the runtime is doing, and reach the correct diagnosis instead of the plausible-but-wrong one.
Snippet 1 — a collapsed flame-graph stack listing
Many tools (Brendan Gregg’s stackcollapse, speedscope import) represent a flame graph as stack;frames count lines. Here are the top lines from a 30-second CPU profile of an HTTP service:
Total samples are ~3125. What does this CPU profile tell you, and where is the fix?
Heads-up queryDB is only ~210 samples (~7%) here. Even so, a CPU profile shows DB time only while the thread is on-CPU; the actual query wait is off-CPU and invisible. The CPU hotspot is serialization.
Heads-up Position and alphabetical order say nothing about cost; width (sample count) does. 95 of ~3125 is ~3% — negligible next to the ~70% under serializeResponse.
Heads-up Sample counts are CPU-time share, not call counts. 2180 of ~3125 samples in one subtree is a dominant hotspot worth fixing.
Snippet 2 — CPU vs wall-clock for one slow span
trace span: GET /report wall = 612 ms cpu profile (this span, by trace-id): on-CPU = 38 ms off-cpu profile (this span): waiting = 561 ms -> sql.(*DB).QueryContext 540 ms -> sync.(*Mutex).Lock 18 ms
Quiz
Completed
Reading these two profiles together, what is the correct diagnosis and next step?
Heads-up Shaving 38 ms of compute when 561 ms is spent waiting is rounding error. The CPU/wall ratio under ~30% is the signal to leave CPU code alone and chase the off-CPU wait.
Heads-up 18 ms is ~3% of wall time. The dominant wait is the 540 ms DB query; the mutex is secondary.
Heads-up They are consistent: on-CPU (38) + off-CPU (561) ≈ wall (612). That is exactly how the two profile types partition a request's wall-clock time.
Snippet 3 — an allocation (heap) profile diff
go tool pprof -base baseline.heap current.heap (45 min apart, same load) flat flat% cum cum% 410MB 61.2% 410MB 61.2% (*logEntry).format -> strings.Join in hot log path 95MB 14.1% 95MB 14.1% json.Marshal 42MB 6.3% 42MB 6.3% bytes.growSlice
Quiz
Completed
This is a base-diffed heap profile under steady load. What does it most likely indicate, and what is the highest-leverage fix?
Heads-up The base-diff isolates growth. 410MB of growth in the log path is ~3x the next entry and the obvious primary suspect; json.Marshal is secondary.
Heads-up A base-diff of two heap profiles taken under steady load is the standard leak-detection technique: functions whose live allocation grew are where memory is accumulating.
Heads-up GOGC changes when GC runs, not whether the log path retains memory. If it is a genuine retention leak, no GC tuning frees still-reachable objects — fix the allocation/retention.
Snippet 4 — a continuous-profiler agent config
profiler: cpu: enabled: true sample_rate_hz: 1000 # default for this service is 100 symbolize: synchronous # resolve symbols on the sampling thread max_stack_depth: 256 upload_interval: 1s
Quiz
Completed
The team reports this agent adds ~12% CPU instead of the expected 2-5%. Reading the config, what are the two changes that bring it back in budget?
Heads-up CPU profiling at the 100 Hz default with async symbolization runs at 2-5%. The overhead here is config: a 10x sample rate and synchronous in-thread symbol resolution, not the feature itself.
Heads-up Truncating to 8 frames would corrupt deep stacks and lose the hot path's context. The dominant costs here are the 10x rate and synchronous symbolization, not a 256-frame cap.
Heads-up Compressed uploads are cheap and infrequent. The CPU cost is in sampling and synchronous symbol resolution on the application thread, not in the network upload cadence.
Recap
Every profiling incident is read in profiles and configs: a CPU flame graph’s dominant subtree (here, reflection-driven JSON) is the on-CPU hotspot; the CPU/wall ratio decides whether to chase on-CPU code or an off-CPU wait; a base-diffed heap profile under steady load localises a leak to the function whose live memory grew; and an over-budget agent is almost always an over-set sample rate plus synchronous in-thread symbolization. Diagnose from the profile, fix the dominant cause, then re-profile to confirm.