Performance PERF · 02 · 09

Hot paths: code and counter reading

Read real Go snippets, a perf stat block, and an N-API hot path; predict the hotspot shape from the evidence and pick the highest-leverage fix.

PERF Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

The shape of a hotspot lives in the code and the counters, not in the flame graph’s width. Read each snippet, classify it into one of the unit’s shapes, and pick the fix a senior engineer would make first — before touching a tuning knob or a library.

Goal

Practise the loop you run in every incident: read the hot path, decide which of the shapes it is from the evidence in front of you (code, perf stat, profile), and reach for the matching fix family instead of guessing.

Snippet 1 — the per-row build

func renderCSV(rows []Row) string {
    out := ""                                 // empty string
    for _, r := range rows {
        out += fmt.Sprintf("%d,%s\n", r.ID, r.Name)  // Sprintf + concat each row
    }
    return out
}

Quiz

renderCSV shows 30% self-time on 200k rows and GC frames (mallocgc) are wide nearby. What is the shape, and what is the single highest-leverage fix?

Snippet 2 — the shared counter line

type Stats struct {
    hits   uint64
    misses uint64   // adjacent field — same 64-byte cache line as hits
}
var s Stats

// goroutine A, hot loop:   atomic.AddUint64(&s.hits, 1)
// goroutine B, hot loop:   atomic.AddUint64(&s.misses, 1)

Quiz

Under load, both atomic adds show wide in the CPU profile with IPC ~0.4 and a 70%+ cache-miss rate, getting worse with more cores. What is happening and what is the fix?

Snippet 3 — the perf stat block

# perf stat -e cycles,instructions,cache-misses,LLC-load-misses ./svc --bench score
   8,400,000,000  cycles
   3,360,000,000  instructions     #  0.40 insns per cycle (IPC)
     900,000,000  cache-misses     # 10.7% of all memory refs
     700,000,000  LLC-load-misses  # 78% of cache misses also miss L3 → DRAM
# hot leaf from flame graph: score_embeddings()  — 42% self-time

Quiz

Reading this perf stat block for score_embeddings, which statement is correct?

Snippet 4 — the native bridge

// Node service, called ~10,000 times/second
function hashAll(items) {
  return items.map(item => nativeHmac(item))  // one N-API crossing per item
}
// nativeHmac (Rust): ~40 ns of actual work
// N-API stub overhead: ~160 ns per crossing
// CPU profile: 40% in napi_call_function, 8% in the Rust HMAC

Quiz

The native HMAC is fast (40 ns) but the N-API stub (160 ns) dominates the CPU profile. What is the diagnosis and the right fix?

Recap

Every shape is read from the evidence in front of you, not the flame-graph width alone: wide GC frames + per-iteration Sprintf/concat is allocation-bound (build into a pre-sized buffer); IPC ~0.4 with high miss rate plus worsening-with-cores on adjacent atomic fields is false sharing (pad to separate cache lines); IPC 0.4 with DRAM-dominated misses is memory-bound (data layout, AoS→SoA); and a bridge stub wider than its cheap native callee is FFI overhead (batch per crossing). Classify first, fix the matching cause, then re-profile to confirm.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.