awesome-everything RU
↑ Back to the climb

Performance

Five shapes of hotspot: CPU, alloc, cache, lock, syscall

Crux Each of the five hotspot categories has a tell-tale signature in the profile and a categorical fix. Picking the wrong fix family wastes the entire engineering effort.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 18 min

Two wide leaves on a flame graph look identical at first glance. One needs a better algorithm. The other needs buffer reuse. Applying a better algorithm to an allocation-bound path moves the metric by 1.05x instead of the predicted 3x. Two hours of work for five percent. The category maps the toolbox.

The five categories

A wide leaf fits one of five categories. Reading the profile’s second layer — not just “which function” but “what is the function doing” — gives the classification.

1. CPU-bound algorithmic

The function executes a lot of instructions. The CPU is running the algorithm.

Signature: large self-time, narrow children, high instructions-per-second, IPC in the range 2–4. In a CPU flame graph the leaf occupies real width with no GC or kernel frames nearby.

Fix family: better algorithm, vectorisation (SIMD), inline pragma, hot-path specialisation for the common case.

2. Allocation-bound

The function (or its caller) allocates so much that garbage collection dominates wall-time.

Signature: runtime.scanobject, gc, mallocgc, or malloc appears wide near the hot leaf. The CPU profile blames GC machinery, not application logic. Switch to an allocation profile to name the application-side allocator.

Fix family: object pooling, buffer reuse (sync.Pool), in-place mutation, struct-of-arrays, pre-size containers to avoid repeated growth.

3. Cache-bound

The function touches memory in a pattern the hardware prefetcher cannot predict. The CPU stalls waiting for data from RAM instead of L1/L2.

Signature: low IPC (<1), high cache-miss rate (15%+), low instructions-per-second despite a wide CPU frame. Hardware counters confirm the stall type (L3 miss, DRAM stall).

Fix family: data-layout change (contiguous arrays instead of pointer-chased linked lists, struct-of-arrays instead of array-of-structs), iteration-order change to improve spatial locality, prefetch hints.

4. Lock-bound

The function spends time waiting on a mutex or channel.

Signature: wide in the mutex/block/off-CPU profile, narrow in the CPU profile. The function is off-CPU, not running. Wall-clock time is high; CPU time is low.

Fix family: lock-free data structures, finer-grained locks, sharded state, read-write locks for read-heavy paths, eventual consistency to eliminate the shared state.

5. Syscall-bound

The function spends time inside the kernel — reading, writing, network I/O, or waiting on futex.

Signature: kernel frames (read, write, recv, futex) visible in a flame graph with kernel-symbol support. Off-CPU time dominates. May appear as frequent narrow kernel entries rather than one wide leaf.

Fix family: batch syscalls (one writev instead of ten write calls), larger I/O buffers, io_uring for async kernel I/O, memory-mapped I/O, eliminate the syscall entirely where data can stay in user space.

CategoryProfile signatureFix family
CPU-boundHigh self-time, IPC 2–4Better algorithm, SIMD, specialisation
Allocation-boundGC frames wide (mallocgc, scanobject)Pooling, buffer reuse, SoA
Cache-boundIPC <1, high cache-miss rateData layout change, contiguous arrays
Lock-boundWide off-CPU, narrow on-CPULock-free, sharding, finer granularity
Syscall-boundKernel frames in flame graphBatch syscalls, io_uring, larger buffers
Hot-path diagnosis numbers
Typical IPC of compute-bound code
2–4 instructions/cycle
Typical IPC of memory-bound code
0.3–0.8 instructions/cycle
L1 cache miss penalty
~5 cycles
L3 cache miss to DRAM penalty
~150–300 cycles
Branch mispredict penalty
~15–25 cycles
Cost of one deopt + recompile (V8)
~10–100 μs
syscall round-trip cost
~1–5 μs
futex lock contention wakeup
~5–50 μs

Classifying a hotspot in practice

Classify a hotspot before picking the fix

1/3
Quiz

A function shows IPC of 0.4 and cache-miss rate of 15%. What is the category, and what fix family does it suggest?

Quiz

After a fix, the local hotspot shrank by 60% but the service's p99 is unchanged. What does this most likely mean?

Recall before you leave
  1. 01
    Walk through the five hot-path categories with one tell-tale signature for each in a profile, and the fix family that matches.
  2. 02
    A Go API shows runtime.mallocgc at 18% and runtime.scanobject at 14% in the CPU profile. What is the category and what should the next diagnostic step be?
Recap

The five hotspot categories — CPU, allocation, cache, lock, syscall — each have a distinct profile signature: IPC and self-time for CPU-bound, GC frames for allocation-bound, low IPC with high cache-miss for cache-bound, wide off-CPU but narrow on-CPU for lock-bound, kernel frames for syscall-bound. The diagnosis takes minutes (capture the right secondary profile, read the IPC or miss-rate); the fix family follows mechanically. Picking the wrong family wastes the work entirely. The next lesson covers how to read the parent and child chains to locate the fix at the right layer of the call tree.

Connected lessons
appears again in159
Continue the climb ↑Reading parent and child chains: where to apply the fix
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.