Performance PERF · 01 · 04

Reading flame graphs: shapes, per-language profilers, and the 60-second scan

Flame graphs have five recognisable shapes that a senior engineer reads before zooming into names. Per-language profiler picks ensure representative measurements, not debug-build artefacts.

PERF Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A senior engineer opens a flame graph and says “volcano pattern — the fix is not in the wide leaf, it is in the dispatch layer two levels up.” They have not read a single function name yet. Shape literacy cuts the time from profile to hypothesis from minutes to seconds.

Reading a flame graph in 60 seconds

Brendan Gregg’s flame graph format is the canonical readout.

Y-axis: call stack depth. Entry point (main, HTTP handler) at the bottom; the leaf function on CPU at the top.
X-axis: alphabetical order of sibling stacks. Position left/right tells you nothing about call order.
Width: proportional to sample count = CPU time share.

Reading workflow:

Scan the top of the graph for the widest leaf frames — those are the functions on CPU when sampled.
Walk down from a wide leaf to see which parent called it, and whether the path is the same across most samples or fragments across many callers.
If most samples converge on the leaf through one path: that is the hot path; fix it there.
If the leaf is called from many parents each contributing a thin slice: the fix is the leaf itself, not its callers.

The most common rookie mistake: reading horizontal position as time order. It is not. Two frames side-by-side tell you nothing about which executed first.

Axis	Meaning	Common misreading
Width (x)	Sample count — CPU time share	People read left-to-right as “time order” — it is NOT
Position (x)	Alphabetical grouping by parent	Left frame does NOT run before right frame
Height (y)	Call depth — entry at bottom, leaf at top	Taller stack = deeper nesting, not slower

Five flame graph shapes and what they mean

An experienced eye reads a flame graph by shape in seconds, before zooming into names.

1. Tall narrow column running to the top through one path — recursion or a deep middleware chain. One fix on the leaf or near it.

2. Wide plateau at one level with dozens of thin leaves — a dispatcher fanning out to many handlers. The fix is usually in the dispatcher or the shared path above it, not in any one handler.

3. Two adjacent wide columns with no shared parent — multiple hot paths needing separate fixes. Amdahl applies to each independently.

4. “Volcano” widening from leaf to root — large cum-time at one entry-point with work distributed below. The fix depends on which level stops being real work and becomes dispatch; walk the shape.

5. Thin spikes scattered across the width — no real hotspot. The latency problem is elsewhere (off-CPU, coordination, network). A CPU profile alone will not show it; try an off-CPU or allocation profile.

These shapes are the vocabulary senior engineers use to describe what a profile looks like before naming functions.

The shape tells you where the fix lives before you read a single function name — leaf, dispatcher, or off-CPU entirely.

Per-language profiler picks

When you see a team attach the wrong profiler to a service — an instrumentation-based tool to a JIT runtime, or cProfile to a production Python service — the flame graph they get will mislead rather than guide. Choosing the right profiler per language is not a minor detail.

Every modern language has a recommended sampling profiler. Using the right one avoids debug-build artefacts and JVM stack-walk errors.

Go: built-in pprof via runtime/pprof and net/http/pprof — industry standard. Read by go tool pprof, exports pprof.proto.
Java: JFR (Java Flight Recorder, default since JDK 11) and async-profiler — production standard because it walks Java stacks correctly via AsyncGetCallTrace, avoiding safepoint bias.
Python: py-spy (external sampler, no app changes required) and scalene. cProfile is instrumentation-based — dev-only, not production.
Node.js: built-in --prof + clinic.js for flame graphs; 0x for production.
Rust: pprof-rs; perf with addr2line for symbol resolution.
Ruby: stackprof and rbspy.
PHP: tideways, Excimer.
C/C++: perf + perf-tools; Intel VTune for microarchitectural analysis.

Continuous-profile backends (Pyroscope, Parca, Polar Signals, Datadog Continuous Profiler) ingest pprof or JFR from any of these. eBPF agents capture stacks kernel-side and work language-agnostic via DWARF unwinding for compiled binaries or JIT-emitted perf maps for V8 and JVM.

The senior choice: one primary profiler per language, one universal eBPF baseline across infrastructure. A polyglot team should normalise all profiles into one backend and read them in one tool.

▸Why this works

async-profiler for Java is the canonical production choice rather than JVMTI-based profilers precisely because of AsyncGetCallTrace: it can walk Java stacks at any point, not only at safepoints. Safepoint-biased profilers over-count time spent on methods that have safepoints (typically hot loops) and under-count time in native methods. The bias can point you at the wrong hotspot in GC-heavy workloads.

Order the steps

Order the 60-second flame graph reading workflow:

1 Scan the top edge — find the widest leaf frames (these are the functions on CPU)
2 Identify the shape before reading names (tall column? wide plateau? scattered spikes?)
3 Walk down from the widest leaf to see which parent called it most often
4 Check whether most samples converge through one path or fan out from many callers
5 If one path: the hot path is there; if many callers: the fix is in the leaf itself
6 Read the function names and look up their self-time vs cum-time in the profile

Quiz

A flame graph shows thin spikes scattered across the full width with no dominant leaf frame. What does this tell you about the latency problem?

Quiz

Why is async-profiler the preferred production profiler for JVM applications rather than a JVMTI-based sampler?

Width ∝ samples (CPU time); y = stack depth (entry at bottom, leaf at top). The widest leaf — decodeRows — is the hot frame; x-position is alphabetical, not time.

Recall before you leave

01
What are the five flame graph shapes and what fix does each suggest?
02
What is safepoint bias in JVM profilers and why does it matter for reading flame graphs?

Recap

A flame graph’s x-axis is alphabetical grouping, not time — the most common misread. Width is sample count (CPU time share); the widest leaf is the hottest function. Shape reading before name reading: tall column means recursion or chain fix; wide plateau means dispatcher fix; scattered spikes mean off-CPU problem not visible in a CPU profile. Per-language profiler choice is not aesthetic — async-profiler avoids JVM safepoint bias; py-spy avoids Python GIL interference; pprof is Go’s built-in standard. Continuous profiling backends (Pyroscope, Parca) normalise all formats into one view across a polyglot stack. Now when you open a flame graph, read the shape before the names — the shape tells you in seconds whether the fix is in the leaf, its dispatcher, or somewhere off-CPU entirely.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

The measurement loop: microbench, macrobench, prod profile, observer effectmiddle

unlocks

What makes a hot path: symptom vs causejunior

deepens into

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Virtual data gridRender and smooth-scroll 100k rows at 60fps with windowing/virtualization, sticky headers, and full keyboard navigation — no library, just math.