awesome-everything RU
↑ Back to the climb

Performance

GC: tame a death-spiral

Crux Hands-on project — instrument, diagnose, and tame a GC death-spiral in a small allocation-heavy service, then prove the fix with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about death-spirals is not the same as pulling a service out of one. Build a small allocation-heavy server, drive it into GC trouble, and apply the unit’s fix ladder until the numbers come back — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: instrument allocation and GC, diagnose the hotspot from a profile, reduce allocations, defend the memory bound, and verify the fix with before/after metrics.

Project
0 of 7
Objective

Take a deliberately allocation-heavy HTTP service (your own or the starter below) and bring its GC CPU share under 5% and its p99 under target — without switching collectors — proving each step with measurements.

Requirements
Acceptance criteria
  • A before/after table: alloc rate, GC CPU %, p99 pause, and p99 request latency — measured under the same load, not estimated.
  • The allocation profile clearly shows the top hotspots shrinking after the fix (re-profiled, not assumed).
  • GC CPU share holds under ~5% and the death-spiral signature is gone from gctrace at sustained load.
  • A one-paragraph write-up naming the lever used for each hotspot and why it ranked above tuning the collector.
Senior stretch
  • Add a one-page on-call runbook: quick triage from the four panels, common allocation causes for your runtime, the fix-priority ladder, and a verification checklist.
  • Add an allocation-driven DoS guard — request-body size limit and result-size cap — and show the service stays bounded under an oversized-payload flood.
  • Add a CI gate that load-tests a canary, diffs the allocation profile against main, and fails the build if any function's allocation share grows more than 20%.
  • Repeat the experiment on a second runtime (e.g. add a JVM or Node version) and compare how the same allocation pattern manifests under a different collector.
Recap

This is the loop you will run in every real GC incident: instrument first, diagnose from a profile, fix at the top of the ladder (eliminate before pool before tune before switch), defend the memory bound with GOMEMLIMIT, and verify with before/after numbers under identical load. Doing it once on a toy service makes the production version muscle memory.

Continue the climb ↑N+1: one logical operation, many round-trips
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.