Performance PERF · 04 · 10

GC: tame a death-spiral

Hands-on project — instrument, diagnose, and tame a GC death-spiral in a small allocation-heavy service, then prove the fix with before/after numbers.

PERF Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about death-spirals is not the same as pulling a service out of one. Build a small allocation-heavy server, drive it into GC trouble, and apply the unit’s fix ladder until the numbers come back — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: instrument allocation and GC, diagnose the hotspot from a profile, reduce allocations, defend the memory bound, and verify the fix with before/after metrics.

Project

0 of 7

Objective

Take a deliberately allocation-heavy HTTP service (your own or the starter below) and bring its GC CPU share under 5% and its p99 under target — without switching collectors — proving each step with measurements.

Requirements

Acceptance criteria

A before/after table: alloc rate, GC CPU %, p99 pause, and p99 request latency — measured under the same load, not estimated.
The allocation profile clearly shows the top hotspots shrinking after the fix (re-profiled, not assumed).
GC CPU share holds under ~5% and the death-spiral signature is gone from gctrace at sustained load.
A one-paragraph write-up naming the lever used for each hotspot and why it ranked above tuning the collector.

Senior stretch

Add a one-page on-call runbook: quick triage from the four panels, common allocation causes for your runtime, the fix-priority ladder, and a verification checklist.
Add an allocation-driven DoS guard — request-body size limit and result-size cap — and show the service stays bounded under an oversized-payload flood.
Add a CI gate that load-tests a canary, diffs the allocation profile against main, and fails the build if any function's allocation share grows more than 20%.
Repeat the experiment on a second runtime (e.g. add a JVM or Node version) and compare how the same allocation pattern manifests under a different collector.

Recap

This is the loop you will run in every real GC incident: instrument first, diagnose from a profile, fix at the top of the ladder (eliminate before pool before tune before switch), defend the memory bound with GOMEMLIMIT, and verify with before/after numbers under identical load. Doing it once on a toy service makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.