Performance PERF · 08 · 02

Classify and fix: matching bottleneck families to remedies

Each bottleneck belongs to one of eight families. Naming the family takes seconds from the profile; picking the wrong family wastes days. Amdahl sets the ceiling on any fix before you write a line of code.

PERF Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

Two engineers look at the same flame graph. One sees “GC is hot” and rewrites the allocator. The other sees “GC is hot because of a bulk upstream payload” and rolls back one deploy. The first spends a week; the second spends 35 minutes. The difference is classification — naming the bottleneck family before touching the code.

The eight bottleneck families

Why does naming the family matter before writing a single line? Because each family has a distinct fix set with zero overlap — applying the wrong one wastes days without moving the metric. The seven pieces of this chapter map exactly to this vocabulary. When you classify a hotspot correctly, the fix set follows immediately.

Family	Profile signal	Chapter piece	Primary fix
CPU-algorithmic	High self-time, low off-CPU	02 hot-paths	Better algorithm or data structure
Allocation-bound	mallocgc / scanobject high	04 GC	Reduce allocations, reuse, pools
Cache-bound	High LLC-miss perf counters	03 cache-vs-bigo	Data layout, access pattern
Lock-bound	High off-CPU on sync wait	02 hot-paths	Lock-free structures, sharding
I/O-bound (N+1)	Many short spans in trace	05 N+1	Batch queries, eager-load
Syscall-bound	High syscall overhead in profile	06 batching	Batch writes, vectorised I/O
JIT-deopt	Deopt frames in JS/JVM profile	02 hot-paths	Monomorphic shapes, typed arrays
Bundle-bound	RUM shows parse/compile time	07 bundle-budgets	Code-split, lazy-load, tree-shake

Amdahl before you code

Before you touch code, ask yourself: even if this fix is perfect, does it actually reach the SLO target? That is what Amdahl’s law answers in 30 seconds.

Amdahl’s law: if fraction f of execution time is spent in the hotspot, the maximum speedup from fixing it is 1 / (1 - f).

If the hotspot is 40% of CPU (f = 0.4), the best possible speedup is 1 / 0.6 = 1.67x. If your SLO needs a 3x improvement, this is the wrong hotspot. Return to step 2 and profile again.

This calculation takes 30 seconds. It prevents weeks of work on the wrong target.

The ceiling accelerates as the hotspot grows: a fix to a 25% hotspot can never beat 1.33x, so reaching a 3x SLO needs a hotspot of roughly 68% or more — otherwise you must combine families.

Example: A checkout service has p99 = 800 ms and a target of under 200 ms. The profile shows:

json.Marshal: 28% CPU
runtime.scanobject: 22% CPU
pgx.Query: 18% CPU (via trace spans — actually I/O)

Amdahl on Marshal alone: 1 / (1 - 0.28) = 1.39x. That brings 800 ms to ~575 ms — still above target. Amdahl on Marshal + scanobject combined (50%): 1 / 0.5 = 2x. That brings to ~400 ms — still above target. Adding the I/O path (68% total): 1 / 0.32 = 3.1x. That brings to ~258 ms — close.

Real root cause: all three hot frames share an upstream that suddenly returns 10x more data. Fix the upstream; all three frames shrink together. Total gain: 6x. The Amdahl calculation made it clear that fixing any single family in isolation was insufficient.

Cross-layer compounding

Real bottlenecks rarely sit in one layer. A slow checkout page might be 30% JS bundle parse on the client, 20% N+1 queries in the backend, 15% GC allocation pressure, and 35% backend compute. Fixing any one layer in isolation gives only Amdahl-bounded wins on the total.

The compound effect is the real win. Two engineers working in parallel — one on bundle, one on queries and GC — each delivering a 2x improvement, give a combined 4x on the headline metric. Without cross-layer reasoning, the conversation stalls: “we already optimised our part.”

The chapter’s classification vocabulary lets engineers from frontend, backend, and infra describe their bottleneck in a way the other layers understand. Without the shared vocabulary, “the backend is slow” and “no, the frontend is slow” is the whole conversation.

▸Why this works

The fix-family table above is not a ranking. Cache-bound bottlenecks in tight compute loops (piece 03) can give 10x improvements on CPU workloads. Allocation-bound regressions (piece 04) are often the sneakiest because GC pauses add tail latency variance that Amdahl underestimates. When classifying, check whether the bottleneck contributes to p50 (average user) or p99 (worst-case user); the fix priority differs.

Classification pays off

Time to classify from a flame graph: 30–90 seconds
Time lost fixing the wrong family: 1–5 days
Amdahl ceiling if bottleneck is 25% of CPU: 1.33x max
Amdahl ceiling if bottleneck is 80% of CPU: 5x max
Typical multi-family compound gain: 4–8x
Single-family fix typical gain: 1.3–2.5x

Quiz

A service is allocation-bound (GC pressure 25%) AND has an 800 KB JS bundle. Which should be fixed first?

Quiz

A flame graph shows runtime.scanobject at 22% and runtime.mallocgc at 11%. Amdahl on the combined GC frames gives a maximum 1.47x speedup. The SLO requires 3x. What is the correct next step?

Order the steps

Order the steps for classifying and fixing a bottleneck from profile to verified result:

1 Open the profile — identify the hottest frame by self-time
2 Name the family: CPU, allocation, cache, lock, I/O, syscall, JIT, bundle
3 Apply Amdahl to check whether fixing this family can reach the SLO target
4 If Amdahl ceiling is insufficient, return to profile and find additional contributors
5 Pick the fix technique from the family's playbook
6 Re-profile under same load; confirm both local frame and headline metric improved

Eight families (three shown) each map to a distinct fix playbook; Amdahl tells you whether one family's fix can reach the SLO or whether you must combine layers.

Recall before you leave

01
Why does naming the bottleneck family matter before writing any fix?
02
A checkout service has three hot families: allocation 28%, I/O 22%, and CPU-algorithmic 18%. What does Amdahl say about each, and what does that imply?
03
What is cross-layer compounding, and why does the shared classification vocabulary enable it?

Recap

Every bottleneck belongs to one of eight families: CPU-algorithmic, allocation, cache, lock, I/O (N+1), syscall (batching), JIT-deopt, or bundle. Naming the family from the profile takes under two minutes; it directs you to the correct fix playbook immediately. Amdahl’s law — maximum speedup equals 1 divided by (1 minus the hotspot fraction) — sets a hard ceiling on any single fix before you write a line of code. If the Amdahl ceiling is below the SLO target, the hotspot is not sufficient to fix alone; return to the profile and find additional contributors. Real production bottlenecks typically span multiple layers; the compound gain from fixing two or three families in parallel is four to eight times larger than fixing any one family in isolation. The shared vocabulary of families lets frontend, backend, and infra engineers describe and quantify their layer’s contribution so parallel work can be coordinated without confusion. Now when you open a flame graph and feel the urge to start coding immediately, run the Amdahl calculation first — 30 seconds that can save a week of work on the wrong target.

Connected lessons

builds on

Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.