Observability OBS · 07 · 09

Profiling: profile and config reading

Read real pprof output, a collapsed flame graph, an allocation profile, and a profiler config — predict the behaviour and pick the highest-leverage diagnosis.

OBS Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Profiles and configs are where profiling problems are actually diagnosed. Read the output and the setup, then choose the read a senior engineer would commit to before touching anything else.

Goal

Practise the loop you run in every profiling incident: read the profile or config, infer what the runtime is doing, and reach the correct diagnosis instead of the plausible-but-wrong one.

Snippet 1 — a collapsed flame-graph stack listing

Many tools (Brendan Gregg’s stackcollapse, speedscope import) represent a flame graph as stack;frames count lines. Here are the top lines from a 30-second CPU profile of an HTTP service:

main;http.Serve;handler;serializeResponse;json.Marshal       2180
main;http.Serve;handler;serializeResponse;reflect.Value.Field  640
main;http.Serve;handler;queryDB;pq.(*conn).Query              210
main;http.Serve;handler;validateInput                          95

Quiz

Total samples are ~3125. What does this CPU profile tell you, and where is the fix?

Snippet 2 — CPU vs wall-clock for one slow span

trace span: GET /report  wall = 612 ms
  cpu profile (this span, by trace-id):  on-CPU = 38 ms
  off-cpu profile (this span):           waiting = 561 ms
    -> sql.(*DB).QueryContext            540 ms
    -> sync.(*Mutex).Lock                 18 ms

Quiz

Reading these two profiles together, what is the correct diagnosis and next step?

Snippet 3 — an allocation (heap) profile diff

go tool pprof -base baseline.heap current.heap   (45 min apart, same load)
  flat  flat%   cum   cum%
  410MB 61.2%  410MB 61.2%  (*logEntry).format  -> strings.Join in hot log path
   95MB 14.1%   95MB 14.1%  json.Marshal
   42MB  6.3%   42MB  6.3%  bytes.growSlice

Quiz

This is a base-diffed heap profile under steady load. What does it most likely indicate, and what is the highest-leverage fix?

Snippet 4 — a continuous-profiler agent config

profiler:
  cpu:
    enabled: true
    sample_rate_hz: 1000        # default for this service is 100
  symbolize: synchronous        # resolve symbols on the sampling thread
  max_stack_depth: 256
  upload_interval: 1s

Quiz

The team reports this agent adds ~12% CPU instead of the expected 2-5%. Reading the config, what are the two changes that bring it back in budget?

Recap

Every profiling incident is read in profiles and configs: a CPU flame graph’s dominant subtree (here, reflection-driven JSON) is the on-CPU hotspot; the CPU/wall ratio decides whether to chase on-CPU code or an off-CPU wait; a base-diffed heap profile under steady load localises a leak to the function whose live memory grew; and an over-budget agent is almost always an over-set sample rate plus synchronous in-thread symbolization. Diagnose from the profile, fix the dominant cause, then re-profile to confirm.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.