Crux Read real Go snippets, a perf stat block, and an N-API hot path; predict the hotspot shape from the evidence and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
The shape of a hotspot lives in the code and the counters, not in the flame graph’s width. Read each snippet, classify it into one of the unit’s shapes, and pick the fix a senior engineer would make first — before touching a tuning knob or a library.
Goal
Practise the loop you run in every incident: read the hot path, decide which of the shapes it is from the evidence in front of you (code, perf stat, profile), and reach for the matching fix family instead of guessing.
Snippet 1 — the per-row build
func renderCSV(rows []Row) string { out := "" // empty string for _, r := range rows { out += fmt.Sprintf("%d,%s\n", r.ID, r.Name) // Sprintf + concat each row } return out}
Quiz
Completed
renderCSV shows 30% self-time on 200k rows and GC frames (mallocgc) are wide nearby. What is the shape, and what is the single highest-leverage fix?
Heads-up The work is allocation and copying, confirmed by the wide GC frames. A 'better algorithm' over the same string concatenation still reallocates on every iteration.
Heads-up GOGC changes when GC runs, not how much the loop allocates. The fix is to allocate less — pre-size the buffer and stop concatenating immutable strings.
Heads-up The signature here is wide GC frames, not low IPC with high miss rate. This is allocation pressure, diagnosed from the alloc profile, not a memory-layout problem.
Snippet 2 — the shared counter line
type Stats struct { hits uint64 misses uint64 // adjacent field — same 64-byte cache line as hits}var s Stats// goroutine A, hot loop: atomic.AddUint64(&s.hits, 1)// goroutine B, hot loop: atomic.AddUint64(&s.misses, 1)
Quiz
Completed
Under load, both atomic adds show wide in the CPU profile with IPC ~0.4 and a 70%+ cache-miss rate, getting worse with more cores. What is happening and what is the fix?
Heads-up There is no lock here and a mutex would be slower. The IPC collapse + high miss rate + worsening-with-cores signature is cache-line bouncing, not scheduler blocking.
Heads-up Two uint64 is 16 bytes; size is not the issue. The issue is two independently-written fields sharing one coherency unit (the 64-byte line).
Heads-up Stats is a single global; nothing is being allocated in the hot loop. The cost is coherency traffic on writes, not GC.
Snippet 3 — the perf stat block
# perf stat -e cycles,instructions,cache-misses,LLC-load-misses ./svc --bench score 8,400,000,000 cycles 3,360,000,000 instructions # 0.40 insns per cycle (IPC) 900,000,000 cache-misses # 10.7% of all memory refs 700,000,000 LLC-load-misses # 78% of cache misses also miss L3 → DRAM# hot leaf from flame graph: score_embeddings() — 42% self-time
Quiz
Completed
Reading this perf stat block for score_embeddings, which statement is correct?
Heads-up Low IPC means the CPU is stalled, not busy. Rewriting math that touches the same scattered memory keeps the same DRAM stalls; the bottleneck is access pattern, not arithmetic.
Heads-up Self-time is share of samples; it does not say whether those cycles retired instructions or stalled. The counters show stalling — memory-bound — despite the wide CPU frame.
Heads-up 10.7% of all refs missing, with 78% of those reaching DRAM, plus IPC 0.40, is the opposite of healthy — it is the defining memory-bound stall pattern.
Snippet 4 — the native bridge
// Node service, called ~10,000 times/secondfunction hashAll(items) { return items.map(item => nativeHmac(item)) // one N-API crossing per item}// nativeHmac (Rust): ~40 ns of actual work// N-API stub overhead: ~160 ns per crossing// CPU profile: 40% in napi_call_function, 8% in the Rust HMAC
Quiz
Completed
The native HMAC is fast (40 ns) but the N-API stub (160 ns) dominates the CPU profile. What is the diagnosis and the right fix?
Heads-up The Rust routine exists for speed/safety; rewriting it in JS removes that value and is usually slower. The diagnosis is crossing overhead, fixed by batching, not by deleting the native code.
Heads-up Concurrency raises aggregate throughput but each call still pays the full 160 ns crossing. Per-item overhead is unchanged; only batching reduces crossings per unit of work.
Heads-up The crypto is only 8% of CPU; the stub is 40%. Optimising a function that is already cheap relative to the bridge is the wrong target.
Recap
Every shape is read from the evidence in front of you, not the flame-graph width alone: wide GC frames + per-iteration Sprintf/concat is allocation-bound (build into a pre-sized buffer); IPC ~0.4 with high miss rate plus worsening-with-cores on adjacent atomic fields is false sharing (pad to separate cache lines); IPC 0.4 with DRAM-dominated misses is memory-bound (data layout, AoS→SoA); and a bridge stub wider than its cheap native callee is FFI overhead (batch per crossing). Classify first, fix the matching cause, then re-profile to confirm.