awesome-everything RU
↑ Back to the climb

Performance

Reading flame graphs: shapes, per-language profilers, and the 60-second scan

Crux Flame graphs have five recognisable shapes that a senior engineer reads before zooming into names. Per-language profiler picks ensure representative measurements, not debug-build artefacts.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 16 min

A senior engineer opens a flame graph and says “volcano pattern — the fix is not in the wide leaf, it is in the dispatch layer two levels up.” They have not read a single function name yet. Shape literacy cuts the time from profile to hypothesis from minutes to seconds.

Reading a flame graph in 60 seconds

Brendan Gregg’s flame graph format is the canonical readout.

  • Y-axis: call stack depth. Entry point (main, HTTP handler) at the bottom; the leaf function on CPU at the top.
  • X-axis: alphabetical order of sibling stacks. Position left/right tells you nothing about call order.
  • Width: proportional to sample count = CPU time share.

Reading workflow:

  1. Scan the top of the graph for the widest leaf frames — those are the functions on CPU when sampled.
  2. Walk down from a wide leaf to see which parent called it, and whether the path is the same across most samples or fragments across many callers.
  3. If most samples converge on the leaf through one path: that is the hot path; fix it there.
  4. If the leaf is called from many parents each contributing a thin slice: the fix is the leaf itself, not its callers.

The most common rookie mistake: reading horizontal position as time order. It is not. Two frames side-by-side tell you nothing about which executed first.

AxisMeaningCommon misreading
Width (x)Sample count — CPU time sharePeople read left-to-right as “time order” — it is NOT
Position (x)Alphabetical grouping by parentLeft frame does NOT run before right frame
Height (y)Call depth — entry at bottom, leaf at topTaller stack = deeper nesting, not slower

Five flame graph shapes and what they mean

An experienced eye reads a flame graph by shape in seconds, before zooming into names.

1. Tall narrow column running to the top through one path — recursion or a deep middleware chain. One fix on the leaf or near it.

2. Wide plateau at one level with dozens of thin leaves — a dispatcher fanning out to many handlers. The fix is usually in the dispatcher or the shared path above it, not in any one handler.

3. Two adjacent wide columns with no shared parent — multiple hot paths needing separate fixes. Amdahl applies to each independently.

4. “Volcano” widening from leaf to root — large cum-time at one entry-point with work distributed below. The fix depends on which level stops being real work and becomes dispatch; walk the shape.

5. Thin spikes scattered across the width — no real hotspot. The latency problem is elsewhere (off-CPU, coordination, network). A CPU profile alone will not show it; try an off-CPU or allocation profile.

These shapes are the vocabulary senior engineers use to describe what a profile looks like before naming functions.

Per-language profiler picks

Every modern language has a recommended sampling profiler. Using the right one avoids debug-build artefacts and JVM stack-walk errors.

  • Go: built-in pprof via runtime/pprof and net/http/pprof — industry standard. Read by go tool pprof, exports pprof.proto.
  • Java: JFR (Java Flight Recorder, default since JDK 11) and async-profiler — production standard because it walks Java stacks correctly via AsyncGetCallTrace, avoiding safepoint bias.
  • Python: py-spy (external sampler, no app changes required) and scalene. cProfile is instrumentation-based — dev-only, not production.
  • Node.js: built-in --prof + clinic.js for flame graphs; 0x for production.
  • Rust: pprof-rs; perf with addr2line for symbol resolution.
  • Ruby: stackprof and rbspy.
  • PHP: tideways, Excimer.
  • C/C++: perf + perf-tools; Intel VTune for microarchitectural analysis.

Continuous-profile backends (Pyroscope, Parca, Polar Signals, Datadog Continuous Profiler) ingest pprof or JFR from any of these. eBPF agents capture stacks kernel-side and work language-agnostic via DWARF unwinding for compiled binaries or JIT-emitted perf maps for V8 and JVM.

The senior choice: one primary profiler per language, one universal eBPF baseline across infrastructure. A polyglot team should normalise all profiles into one backend and read them in one tool.

Why this works

async-profiler for Java is the canonical production choice rather than JVMTI-based profilers precisely because of AsyncGetCallTrace: it can walk Java stacks at any point, not only at safepoints. Safepoint-biased profilers over-count time spent on methods that have safepoints (typically hot loops) and under-count time in native methods. The bias can point you at the wrong hotspot in GC-heavy workloads.

Order the steps

Order the 60-second flame graph reading workflow:

  1. 1 Scan the top edge — find the widest leaf frames (these are the functions on CPU)
  2. 2 Identify the shape before reading names (tall column? wide plateau? scattered spikes?)
  3. 3 Walk down from the widest leaf to see which parent called it most often
  4. 4 Check whether most samples converge through one path or fan out from many callers
  5. 5 If one path: the hot path is there; if many callers: the fix is in the leaf itself
  6. 6 Read the function names and look up their self-time vs cum-time in the profile
Quiz

A flame graph shows thin spikes scattered across the full width with no dominant leaf frame. What does this tell you about the latency problem?

Quiz

Why is async-profiler the preferred production profiler for JVM applications rather than a JVMTI-based sampler?

Recall before you leave
  1. 01
    What are the five flame graph shapes and what fix does each suggest?
  2. 02
    What is safepoint bias in JVM profilers and why does it matter for reading flame graphs?
Recap

A flame graph’s x-axis is alphabetical grouping, not time — the most common misread. Width is sample count (CPU time share); the widest leaf is the hottest function. Shape reading before name reading: tall column means recursion or chain fix; wide plateau means dispatcher fix; scattered spikes mean off-CPU problem not visible in a CPU profile. Per-language profiler choice is not aesthetic — async-profiler avoids JVM safepoint bias; py-spy avoids Python GIL interference; pprof is Go’s built-in standard. Continuous profiling backends (Pyroscope, Parca) normalise all formats into one view across a polyglot stack.

Connected lessons
appears again in159
Continue the climb ↑Statistical baselines: why one run is not a measurement
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.