awesome-everything RU
↑ Back to the climb

Observability

Flame graphs: reading the picture that shows where time goes

Crux A profiler samples your call stack 100 times per second and draws a flame graph — the widest frame at any level is the function eating the most CPU, discoverable in one glance.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 12 min

A trace says “1.2 seconds in inventory.” You have logs, metrics, dashboards — but none of them tell you which function inside inventory ate the time. Profiling answers that question in 60 seconds without a debugger or a guess.

What a profiler does

A profiler interrupts your program 100 times per second and captures the current call stack — the chain of functions from main down to whatever is running right now. After 30 seconds it has 3,000 snapshots. The function that appears most often across those snapshots is the one consuming the most CPU.

Flame graphs visualise this: stacks are sorted alphabetically along the x-axis, width is proportional to sample count, depth on the y-axis goes from the entry point at the bottom to the leaf function at the top. The widest frame at any level is the busiest path.

The stadium metaphor

Imagine a stadium with 100,000 people doing different things. A helicopter flies overhead 100 times per second and photographs who is doing what. After a minute you have 6,000 photos. Count which activity appears most across the photos — that is where the crowd’s “CPU time” goes.

The flame graph is the bar chart of those counts, with callers stacked beneath callees. Wide bars = popular activities. The helicopter is the profiler; the photos are stack samples; the chart is the flame graph.

Reading a flame graph in practice

Bea is on-call. Inventory service, p99 = 1.5 s. She opens the continuous-profile dashboard, filters by trace-id, and sees a flame graph with one massive 1.1-second-wide block: json.Marshal inside serializeResponse. The fix is obvious: cache the marshalled response or pre-encode at write-time. Without profiling, the team would have guessed — DB? Cache? Network? With the flame graph there is no guessing.

AxisMeaningCommon misreading
Width (x)Sample count — CPU time sharePeople read left-to-right as “time order” — it is NOT
Position (x)Alphabetical grouping by parentLeft frame does NOT run before right frame
Height (y)Call depth — main at bottom, leaf at topTaller stack = deeper nesting, not slower

How to capture a CPU profile with pprof

A Node API has a p99 jump. Tracing finds a slow span. The continuous-profile dashboard, filtered by trace-id, shows a flame graph dominated by a regex compile in a handler. A library upgrade introduced an O(n²) regex; fix is to precompile it outside the handler. Detection: 60 seconds.

For Go services, pprof is built-in:

// 1. Expose pprof handlers (registers /debug/pprof/* routes)
import _ "net/http/pprof"

// Start the debug server
go func() {
  http.ListenAndServe("localhost:6060", nil)
}()

// 2. Capture a 30-second CPU profile under load:
//    go tool pprof -http=:9090 \
//      http://localhost:6060/debug/pprof/profile?seconds=30
//
// 3. The flame graph view opens at :9090.
//    Widest top-level leaf = hot path.

You must run the profile under representative load — on an idle system, almost everything in the samples is the runtime’s idle loop, useless for finding hot paths.

Quiz

What does the WIDTH of a frame on a flame graph represent?

Quiz

A continuous profiler runs in production at 2-5% CPU overhead. Why doesn't it ruin performance?

Order the steps

Order the steps of CPU profiling a slow function with pprof:

  1. 1 Identify the suspicious workload (slow span, high CPU, slow endpoint)
  2. 2 Start profiling (pprof.StartCPUProfile or /debug/pprof/profile endpoint)
  3. 3 Run the suspicious workload for 30 seconds under load
  4. 4 Stop profiling and save the profile file
  5. 5 Open the profile in a flame graph viewer (go tool pprof, speedscope, Pyroscope)
  6. 6 Find the widest frame at the leaf level — that is the hot function
  7. 7 Walk up the parents to see who is calling the hot path, then apply the fix
Complete the analogy

Fill in the blank: a flame graph's vertical axis shows the call _______ — main is at the bottom, the function on the CPU is at the top.

Recall before you leave
  1. 01
    In one paragraph: why is a flame graph almost always faster than a debugger or print statements for finding the slow part of a program?
  2. 02
    What is the most common misreading of a flame graph and what does the x-axis actually mean?
  3. 03
    Why must you run the profile under representative load, not on a quiet system?
Recap

A profiler interrupts the program ~100 times per second, captures the call stack, and after many samples draws a flame graph where width equals CPU share. The widest frame at any level is the hottest code path — no guessing required. The x-axis is alphabetical grouping of stacks, not time; misreading it as time order is the single most common rookie mistake. You must profile under representative load; an idle system only shows the runtime’s idle loop. With a continuous profiler always running at 2-5% overhead, the flame graph for any SLO-burning incident is already saved the moment the pager fires.

Connected lessons
appears again in167
Continue the climb ↑Sampling vs instrumentation profiling: why 99 Hz wins in production
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.