Observability OBS · 07 · 01

Flame graphs: reading the picture that shows where time goes

A profiler samples your call stack 100 times per second and draws a flame graph — the widest frame at any level is the function eating the most CPU, discoverable in one glance.

OBS Junior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A trace says “1.2 seconds in inventory.” You have logs, metrics, dashboards — but none of them tell you which function inside inventory ate the time. Profiling answers that question in 60 seconds without a debugger or a guess. In the next ten minutes you will know exactly how to read the picture that shows where your time goes.

What a profiler does

A profiler interrupts your program 100 times per second and captures the current call stack — the chain of functions from main down to whatever is running right now. After 30 seconds it has 3,000 snapshots. The function that appears most often across those snapshots is the one consuming the most CPU.

Sampling is statistical, not exhaustive: ~3,000 snapshots in 30s are enough that the most-sampled function is provably the CPU hog — which is why a 2-5% profiler can run always-on.

Flame graphs visualise this: stacks are sorted alphabetically along the x-axis, width is proportional to sample count, depth on the y-axis goes from the entry point at the bottom to the leaf function at the top. The widest frame at any level is the busiest path.

Each row is one call-stack level (main at the bottom, the on-CPU leaf at the top). A frame's width is the share of samples spent in it — never less than the sum of its children. json.Marshal is the widest leaf, so it is the hot path; gc is a thin sliver.

The stadium metaphor

Imagine a stadium with 100,000 people doing different things. A helicopter flies overhead 100 times per second and photographs who is doing what. After a minute you have 6,000 photos. Count which activity appears most across the photos — that is where the crowd’s “CPU time” goes.

The flame graph is the bar chart of those counts, with callers stacked beneath callees. Wide bars = popular activities. The helicopter is the profiler; the photos are stack samples; the chart is the flame graph.

Reading a flame graph in practice

Bea is on-call. Inventory service, p99 = 1.5 s. She opens the continuous-profile dashboard, filters by trace-id, and sees a flame graph with one massive 1.1-second-wide block: json.Marshal inside serializeResponse. The fix is obvious: cache the marshalled response or pre-encode at write-time. Without profiling, the team would have guessed — DB? Cache? Network? With the flame graph there is no guessing.

Axis	Meaning	Common misreading
Width (x)	Sample count — CPU time share	People read left-to-right as “time order” — it is NOT
Position (x)	Alphabetical grouping by parent	Left frame does NOT run before right frame
Height (y)	Call depth — main at bottom, leaf at top	Taller stack = deeper nesting, not slower

How to capture a CPU profile with pprof

A Node API has a p99 jump. Tracing finds a slow span. The continuous-profile dashboard, filtered by trace-id, shows a flame graph dominated by a regex compile in a handler. A library upgrade introduced an O(n²) regex; fix is to precompile it outside the handler. Detection: 60 seconds.

For Go services, pprof is built-in:

// 1. Expose pprof handlers (registers /debug/pprof/* routes)
import _ "net/http/pprof"

// Start the debug server
go func() {
  http.ListenAndServe("localhost:6060", nil)
}()

// 2. Capture a 30-second CPU profile under load:
//    go tool pprof -http=:9090 \
//      http://localhost:6060/debug/pprof/profile?seconds=30
//
// 3. The flame graph view opens at :9090.
//    Widest top-level leaf = hot path.

You must run the profile under representative load — on an idle system, almost everything in the samples is the runtime’s idle loop, useless for finding hot paths.

Quiz

What does the WIDTH of a frame on a flame graph represent?

Quiz

A continuous profiler runs in production at 2-5% CPU overhead. Why doesn't it ruin performance?

Order the steps

Order the steps of CPU profiling a slow function with pprof:

1 Identify the suspicious workload (slow span, high CPU, slow endpoint)
2 Start profiling (pprof.StartCPUProfile or /debug/pprof/profile endpoint)
3 Run the suspicious workload for 30 seconds under load
4 Stop profiling and save the profile file
5 Open the profile in a flame graph viewer (go tool pprof, speedscope, Pyroscope)
6 Find the widest frame at the leaf level — that is the hot function
7 Walk up the parents to see who is calling the hot path, then apply the fix

Complete the analogy

Fill in the blank: a flame graph's vertical axis shows the call _______ — main is at the bottom, the function on the CPU is at the top.

Recall before you leave

01
In one paragraph: why is a flame graph almost always faster than a debugger or print statements for finding the slow part of a program?
02
What is the most common misreading of a flame graph and what does the x-axis actually mean?
03
Why must you run the profile under representative load, not on a quiet system?

Recap

A profiler interrupts the program ~100 times per second, captures the call stack, and after many samples draws a flame graph where width equals CPU share. The widest frame at any level is the hottest code path — no guessing required. The x-axis is alphabetical grouping of stacks, not time; misreading it as time order is the single most common rookie mistake. You must profile under representative load; an idle system only shows the runtime’s idle loop. With a continuous profiler always running at 2-5% overhead, the flame graph for any SLO-burning incident is already saved the moment the pager fires. Now when you see a wide frame dominating a flame graph, you know the exact question to ask: who calls it, and can its work be reduced or cached?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Production propagation failures, span links, and platform designsenior

unlocks

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.