Performance PERF · 02 · 02

Five shapes of hotspot: CPU, alloc, cache, lock, syscall

Each of the five hotspot categories has a tell-tale signature in the profile and a categorical fix. Picking the wrong fix family wastes the entire engineering effort.

PERF Middle ◷ 18 min

Level

FoundationsJuniorMiddleSenior

Two wide leaves on a flame graph look identical at first glance. One needs a better algorithm. The other needs buffer reuse. Applying a better algorithm to an allocation-bound path moves the metric by 1.05x instead of the predicted 3x. Two hours of work for five percent. The category maps the toolbox.

The five categories

By the end of this lesson you will be able to classify any hot leaf in under five minutes and select the fix family that actually moves the metric.

A wide leaf fits one of five categories. Reading the profile’s second layer — not just “which function” but “what is the function doing” — gives the classification.

1. CPU-bound algorithmic

The function executes a lot of instructions. The CPU is running the algorithm.

Signature: large self-time, narrow children, high instructions-per-second, IPC in the range 2–4. In a CPU flame graph the leaf occupies real width with no GC or kernel frames nearby.

Fix family: better algorithm, vectorisation (SIMD), inline pragma, hot-path specialisation for the common case.

2. Allocation-bound

The function (or its caller) allocates so much that garbage collection dominates wall-time.

Signature: runtime.scanobject, gc, mallocgc, or malloc appears wide near the hot leaf. The CPU profile blames GC machinery, not application logic. Switch to an allocation profile to name the application-side allocator.

Fix family: object pooling, buffer reuse (sync.Pool), in-place mutation, struct-of-arrays, pre-size containers to avoid repeated growth.

3. Cache-bound

The function touches memory in a pattern the hardware prefetcher cannot predict. The CPU stalls waiting for data from RAM instead of L1/L2.

Signature: low IPC (<1), high cache-miss rate (15%+), low instructions-per-second despite a wide CPU frame. Hardware counters confirm the stall type (L3 miss, DRAM stall).

Fix family: data-layout change (contiguous arrays instead of pointer-chased linked lists, struct-of-arrays instead of array-of-structs), iteration-order change to improve spatial locality, prefetch hints.

4. Lock-bound

The function spends time waiting on a mutex or channel.

Signature: wide in the mutex/block/off-CPU profile, narrow in the CPU profile. The function is off-CPU, not running. Wall-clock time is high; CPU time is low.

Fix family: lock-free data structures, finer-grained locks, sharded state, read-write locks for read-heavy paths, eventual consistency to eliminate the shared state.

5. Syscall-bound

The function spends time inside the kernel — reading, writing, network I/O, or waiting on futex.

Signature: kernel frames (read, write, recv, futex) visible in a flame graph with kernel-symbol support. Off-CPU time dominates. May appear as frequent narrow kernel entries rather than one wide leaf.

Fix family: batch syscalls (one writev instead of ten write calls), larger I/O buffers, io_uring for async kernel I/O, memory-mapped I/O, eliminate the syscall entirely where data can stay in user space.

Category	Profile signature	Fix family
CPU-bound	High self-time, IPC 2–4	Better algorithm, SIMD, specialisation
Allocation-bound	GC frames wide (mallocgc, scanobject)	Pooling, buffer reuse, SoA
Cache-bound	IPC <1, high cache-miss rate	Data layout change, contiguous arrays
Lock-bound	Wide off-CPU, narrow on-CPU	Lock-free, sharding, finer granularity
Syscall-bound	Kernel frames in flame graph	Batch syscalls, io_uring, larger buffers

Hot-path diagnosis numbers

Typical IPC of compute-bound code: 2–4 instructions/cycle
Typical IPC of memory-bound code: 0.3–0.8 instructions/cycle
L1 cache miss penalty: ~5 cycles
L3 cache miss to DRAM penalty: ~150–300 cycles
Branch mispredict penalty: ~15–25 cycles
Cost of one deopt + recompile (V8): ~10–100 μs
syscall round-trip cost: ~1–5 μs
futex lock contention wakeup: ~5–50 μs

Classifying a hotspot in practice

Classify a hotspot before picking the fix

1/3

Same hotspot, same effort — the wrong fix family buys ~5%, the right one buys 4.7x locally and shrinks GC. Diagnose the category before picking the toolbox.

Quiz

A function shows IPC of 0.4 and cache-miss rate of 15%. What is the category, and what fix family does it suggest?

Quiz

After a fix, the local hotspot shrank by 60% but the service's p99 is unchanged. What does this most likely mean?

1. CPU-bound high self-time, IPC 2–4

2. Allocation-bound wide GC frames (mallocgc)

3. Cache-bound IPC <1, high cache-miss

4. Lock-bound wide off-CPU, narrow on-CPU

5. Syscall-bound kernel frames (read/futex)

A wide leaf fits one of five categories — the profile signature names which one, and the category selects the fix family.

Recall before you leave

01
Walk through the five hot-path categories with one tell-tale signature for each in a profile, and the fix family that matches.
02
A Go API shows runtime.mallocgc at 18% and runtime.scanobject at 14% in the CPU profile. What is the category and what should the next diagnostic step be?

Recap

The five hotspot categories — CPU, allocation, cache, lock, syscall — each have a distinct profile signature: IPC and self-time for CPU-bound, GC frames for allocation-bound, low IPC with high cache-miss for cache-bound, wide off-CPU but narrow on-CPU for lock-bound, kernel frames for syscall-bound. The diagnosis takes minutes (capture the right secondary profile, read the IPC or miss-rate); the fix family follows mechanically. Picking the wrong family wastes the work entirely. The next lesson covers how to read the parent and child chains to locate the fix at the right layer of the call tree. Now when you see GC frames climbing wide alongside a hot leaf, you will reach for the allocation profile before touching a single line of application logic.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

What makes a hot path: symptom vs causejunior

unlocks

deepens into

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.