Performance capstone: reading profiles, traces, and stats
Crux Read real artifacts from across the track — a pprof profile, a hot-path snippet, an N+1 query log, and a bundle report — predict the dominant cost, and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Every unit in the track leaves a different artifact behind: a profile, a hot path, a query log, a bundle report. The senior skill is reading the artifact, naming the dominant cost, and reaching for the right layer — not the most familiar one.
Goal
Practise the cross-track loop on real evidence: read the artifact, locate the cost that dominates, and choose the fix at the layer where that cost lives — before touching any knob.
Reading this `top -cum` output, where is the time actually spent and what does Amdahl's law say about optimising json.Marshal?
Heads-up 92% is cumulative — it is the root frame, so almost everything sits under it by definition. Cumulative at the top tells you nothing; flat% is where work is actually done, and that is cosineSim.
Heads-up Its cumulative share is ~6%, so Amdahl's law caps the page-level payoff at 6% no matter how fast you make it. The 69%-flat cosineSim is where the leverage is.
Heads-up One frame holds 69% flat. That is the textbook signature of a single dominant hot path — the opposite of evenly spread.
Artifact 2 — the hot path
// app.scoreAll — called once per search request, ~10k candidatesfunc scoreAll(q []float32, candidates []Doc) []Result { var results []Result // nil slice for _, d := range candidates { v := append([]float32{}, d.Vector...) // fresh copy per candidate results = append(results, Result{d.ID, cosineSim(q, v)}) } return results}
Quiz
Completed
Given Artifact 1 pointed here, what is the highest-leverage fix in this loop, and which is a distraction?
Heads-up GOGC changes when GC runs, not how much this loop allocates. The profile shows the cost is in the path itself; eliminate the per-candidate copy and pre-size the slice instead of deferring collection.
Heads-up Parallelism multiplies the same wasteful per-candidate copies across cores and adds coordination overhead. Remove the wasted allocation first; only then consider parallelism if cosineSim is genuinely CPU-bound.
Heads-up Feeding cosineSim a freshly copied slice per candidate adds allocation and cache misses on top of the math. Eliminating the copy reduces both the allocation the profile shows and the data cosineSim must stream.
Artifact 3 — the query log
SELECT id, total FROM orders WHERE user_id = $1 -- 1 row set, 50 rowsSELECT name FROM customers WHERE id = $1 -- params: 11SELECT name FROM customers WHERE id = $1 -- params: 12SELECT name FROM customers WHERE id = $1 -- params: 13... (47 more identical-shape queries) ...-- total: 51 queries, 1 request
Quiz
Completed
This log is one request. Name the pattern and the fix that removes it at the source.
Heads-up The 51 queries run on one request and reuse the connection; the cost is the round-trip count, not pool size. A bigger pool lets more requests issue the same wasteful 51-query pattern in parallel.
Heads-up customers.id is the primary key; the lookups are already fast. The waste is doing 50 separate round-trips instead of one batched query — an index does not change that.
Heads-up A replica spreads the same 50 round-trips to another machine; the structural N+1 remains and the latency of 50 sequential trips persists. Batch the access instead of relocating it.
Artifact 4 — the bundle report
Route /product/[id] first-load JS framework chunk .............. 48.2 KB page chunk ................... 19.4 KB moment + moment-tz ........... 71.9 KB <-- date formatting lodash (full) ................ 72.3 KB <-- used: groupBy, debounce ------------------------------------------ total (gzip) ................. 211.8 KB budget: 130 KB ❌ OVER by 81.8 KB
Quiz
Completed
This route is 82 KB over budget. What is the cost these bytes impose, and what is the highest-leverage trim?
Heads-up Transfer is the smaller cost; parse/compile/execute on the device dominates TTI and gzip does not shrink that CPU work. The fix is shipping fewer bytes, not relying on compression.
Heads-up A full `import _ from 'lodash'` (CommonJS) defeats tree-shaking; the whole 72 KB ships. You must import the specific functions or use lodash-es / native equivalents for the bytes to drop.
Heads-up Raising the budget to match the bloat is surrendering the enforce step. Budgets exist to force the trim; the moment + lodash swap removes ~100 KB without touching the budget.
Recap
Four artifacts, one discipline. A top -cum profile distinguishes the root frame (huge cumulative, no leverage) from the real hotspot (high flat); the hot-path snippet shows allocation waste the profile predicted; the query log exposes N+1 by its repeated single-row shape; the bundle report turns bytes into device-CPU cost. In each case the fix lives at the cost’s own layer — eliminate the allocation, batch the round-trips, trim the bytes — never one knob removed from where the evidence points.