Observability OBS · 07 · 03

Profile types: CPU, memory, off-CPU, mutex — which one to reach for

CPU profiling only sees code that is running; off-CPU, block, and mutex profiles cover the 96% of a request that can be spent waiting — you need all four to diagnose any slow service.

OBS Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

A span takes 500 ms. You open the CPU profile — the service used 20 ms of CPU. Where did the other 480 ms go? CPU profiling is blind to waiting. If you stop there, you will never find the answer.

CPU profiles: what they see and what they miss

CPU profiling samples the call stack while the thread is on the CPU — running instructions. A request that spends 20 ms computing and 480 ms waiting on a database query will show 20 ms in the CPU profile and leave 480 ms invisible.

The CPU profile sees only the 20 ms of compute (4%); the 480 ms of off-CPU waiting (96%) is invisible to it. That blind spot is why off-CPU, block, and mutex profiles exist.

This is the most important constraint in profiling: CPU profiles only see functions that are consuming the processor. Anything the program waits for — I/O, network, locks, scheduling — is off-CPU and invisible to a CPU profiler.

Memory and allocation profiles

Heap profilers sample allocations, not CPU. Go’s heap profile samples one allocation per ~512 KiB (Poisson distributed) and records the stack at each sample. The result is a flame graph where width is allocated bytes, not CPU time. This finds memory hotspots: a function allocating 100 MB/s shows up wide.

Memory leak detection with heap profiles:

Take a heap profile.
Wait 30-60 minutes.
Take another heap profile.
Diff them (go tool pprof -base baseline.heap current.heap).
Functions whose allocation grew are leaking.

Together these steps give you a before-and-after snapshot of live memory; without the baseline (step 1) you have no reference and the growing allocation is invisible in a single profile.

Allocation profiles capture short-lived allocations that GC reclaims; heap profiles snapshot live memory. Both are useful. JVM equivalents: async-profiler with -e alloc, JFR allocation events. Python: tracemalloc, memray.

Off-CPU profiles

Brendan Gregg’s 2013 work on off-CPU analysis identified the gap: CPU profiles miss everything a process waits for. eBPF implementations hook into the kernel scheduler’s switch events. When the scheduler removes a thread from the CPU (it blocks on I/O, sleeps, waits on a lock), the kernel captures the thread’s stack. That stack represents where the wait started. When the thread comes back, elapsed time is attributed to that stack.

The off-CPU flame graph shows wait time the same way a CPU flame graph shows running time. For an I/O-bound service, the off-CPU profile is the only profile that explains anything — the CPU profile just says “service was idle.”

Block and mutex profiles

Block profile (Go: runtime.SetBlockProfileRate): time spent waiting on synchronisation primitives — channels, condition variables, WaitGroups. More focused than off-CPU because it targets language-level synchronisation.

Mutex profile (runtime.SetMutexProfileFraction): lock contention specifically. Reports which code held a lock while others were waiting for it, attributed at unlock time.

CPU profile on-CPU time — where do cycles go?

Wall-clock profile elapsed time, incl. waiting — where does the clock go?

Heap / allocation profile bytes allocated — where does memory go?

Off-CPU / block / mutex profile lock + I/O wait — where does the waiting go?

Each profile answers a different 'where does X go?' question by attributing a different resource: CPU cycles, elapsed clock, allocated bytes, or time spent blocked off-CPU.

Profile type	Width measures	When to reach for it
CPU	CPU time consumed	CPU usage is high, slow response
Heap / Allocation	Bytes allocated	GC pressure, OOM, memory growth
Off-CPU	Time spent waiting (all causes)	Slow request but low CPU usage
Block	Time on sync primitives	Go goroutine contention suspected
Mutex	Lock contention time	High lock contention suspected

Choosing the right profile from the CPU/wall-time ratio

The diagnostic shortcut: look at CPU time vs wall-clock time for the slow request.

CPU/wall ≈ 100%: computation bottleneck — CPU profile.
CPU/wall < 30%: the bottleneck is off-CPU — off-CPU / block profile or trace spans.
Memory growing steadily: heap / allocation profile.
Threads contending on a lock: mutex profile.

A Java service GC-thrashing is a classic allocation-profile case. The symptom is high heap allocation rate with frequent old-gen GC. The allocation flame graph will show the widest frame as the function allocating at highest rate — often string concatenation in logging code that is not using parameterized formatting.

Quiz

A request spends 50 ms on CPU and 450 ms waiting on a DB query. Which profile type would show you the DB wait?

Quiz

A Java service is OOMing on certain endpoints. Its CPU profile looks normal. Which profile type to reach for?

Recall before you leave

01
Why does Go's heap profiler sample one allocation per ~512 KiB instead of recording every allocation?
02
Explain why a CPU flame graph is not enough to diagnose an I/O-bound service.
03
What is the procedure to detect a memory leak with heap profiles?

Recap

Four profile types cover the full request lifecycle: CPU (what is running), heap or allocation (what is being allocated), off-CPU (what is waiting on I/O or scheduling), and block or mutex (what is waiting on locks). CPU profiling only sees code that is actively on the processor — a request waiting 480 ms on a DB query will show only the 20 ms of compute in a CPU profile. The CPU/wall-time ratio is the diagnostic signal: under 30% means the bottleneck is off-CPU. Go’s heap profiler samples at 1-per-512 KiB to make always-on memory profiling affordable. Combining all four types gives a complete picture; using only CPU profiling in an I/O-bound service guarantees you find the wrong bottleneck. Now when you see a slow request with low CPU usage, you know the reflex: check the CPU/wall-time ratio first, then reach for the off-CPU or block profile rather than adding more instrumentation.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle

unlocks

Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.