Performance PERF · 04 · 06

GC in production: observability, security, edge cases, and fleet governance

GC pressure shows up in production as SLO burns before it shows up as OOMs. Building alloc-rate dashboards, finalizer-safe code, and a one-page runbook accessible to on-call SREs is what separates teams that chase GC fires from teams that prevent them.

PERF Senior ◷ 20 min

Level

FoundationsJuniorMiddleSenior

A Go service starts the day at 1% GC CPU. By minute 60, gctrace shows 39% of every CPU cycle spent in the collector. Concurrent mark time grew from 1.3 ms to 361 ms. The service is in a GC death-spiral — and it will OOM before on-call finishes reading the alert.

Reading gctrace: the Go death-spiral

GODEBUG=gctrace=1 writes one line per GC cycle:

gc N @Ts X%: Apre+Aconc+Apost ms clock, ... ms cpu, Bpre->Bduring->Bafter MB, goal MB, P

Key fields:

X% — fraction of CPU spent in GC since program start.
Aconc ms — time of the concurrent marking phase.
Bpre->Bduring->Bafter — heap size before, during, and after the cycle.
goal MB — the heap size the pacer is targeting.

A death-spiral signature:

gc 1   @0.012s  1%: 0.011+1.3+0.018 ms ...  4->4->1 MB,  5 MB goal
gc 2   @0.045s  3%: 0.024+5.8+0.032 ms ...  5->8->4 MB,  7 MB goal
...
gc 487 @61.2s  38%: 0.45+340+1.2 ms  ... 1820->2890->1450 MB, 1900 MB goal
gc 488 @62.1s  39%: 0.41+352+1.1 ms  ... 1900->2980->1490 MB, 1990 MB goal

Concurrent mark growing from 1.3 ms → 352 ms while GC% rises from 1% to 39% means the pacer cannot finish each cycle before the next allocation wave arrives. The collector falls behind; the pacer responds by scheduling GC more aggressively, which steals CPU, which reduces application throughput, which makes allocation pressure worse relative to available CPU. Left unchecked, this ends in OOM.

The death-spiral in one picture: GC CPU stays near 1–3% early, then runs away to 38–39% by minute 60 — long past the 10% alert line. The pacer steals more CPU the further behind it falls.

Fix priorities:

Immediate — find the allocation hotspot via /debug/pprof/allocs; reduce allocation rate. Even a 50% cut should drop GC CPU below 15%.
Short-term — set GOMEMLIMIT to ~90% of the container’s memory limit; the pacer will defend the bound.
Tuning — GOGC=200 defers GC to when the heap doubles-doubles (trades memory for lower cycle frequency). Only after allocation reduction; it masks, not fixes.
Architecture — if the workload genuinely needs lots of live data, consider off-heap stores (Redis, mmap’d files) instead of in-heap caches.

Production observability per runtime

Runtime	Alloc-rate metric	Pause metric	GC CPU metric
Go	`runtime/metrics: /gc/heap/allocs:bytes`	`PauseTotalNs` rate	`gctrace X%`
JVM	Micrometer `gc_memory_allocated_bytes_total`	`gc_pause_seconds` histogram	JFR GCCPUTime events
Node	`v8.getHeapStatistics()` delta	`PerformanceObserver ‘gc’`	No built-in; derive from pause total
.NET	`dotnet-counters` `alloc-rate`	EventCounters `gc-pause-time-percent`	`gc-pause-time-percent`

The senior dashboard pattern — four panels per service:

Allocation rate (bytes/s) over time.
GC pause distribution (p50/p99/max histogram).
GC CPU share (%).
Heap size vs live-set trend.

Tie to SLO burns: GC pause regressions are a leading indicator for tail-latency SLO violations. Alert on alloc rate crossing a per-service threshold (default 300–500 MB/s/core) for more than 5 minutes; alert on p99 pause above 100 ms (G1) / 5 ms (ZGC) / 50 ms (Go); alert on GC CPU > 10%.

Security: allocation-driven DoS

An attacker who can cause the server to allocate large objects can drive it to OOM or crippling GC overhead. Heap exhaustion is in the OWASP Top 10 server-side DoS vectors. Ask yourself about every public endpoint: what is the maximum amount of heap this request can cause my service to allocate, and is that amount bounded?

Attack vectors:

Oversized request bodies: parse a 100 MB JSON to discover one bad field.
Unbounded query results: return all rows when pagination was expected.
Regex bombs: backtracking allocates intermediate matching state.
Zip-bomb decompression: small input → huge expansion.
Deep JSON nesting: recursive parsers allocate call-stack-equivalent objects.

Mitigations:

Enforce request body size limits at the gateway (default 1 MB; larger on specific endpoints with explicit auth).
Cap query result sizes server-side; never SELECT * without a LIMIT.
Use RE2-based regex engines (no backtracking; linear time).
Validate compression ratios before decompressing.
Set per-request memory tracking with hard limits and explicit overflow handling.

▸Why this works

Every allocation site that scales with attacker-controlled input needs an explicit bound. The Linux kernel uses kmem_cache limits and cgroup memory caps; application code should mirror this discipline. A single unguarded endpoint that accepts multi-MB payloads can bring down a service by triggering GC pressure that propagates to every request.

Edge cases

Finalizer storms: registering many objects with finalizers (Object.finalize in Java, runtime.SetFinalizer in Go, FinalizationRegistry in JS) requires the GC to queue them for a separate finalizer thread. A burst of finalisable objects in a tight loop can stall the collector while the finalizer queue drains. File handles, sockets, and native memory held by finalizers remain open until the queue clears.

Fix: avoid finalizers entirely. Use explicit close() / Closeable / defer patterns. In Java, java.lang.ref.Cleaner (JDK 9+) is a safer backstop than finalize(). In Go, prefer defer over SetFinalizer.

Pinned objects: objects that cannot move (DMA buffers, JNI-pinned arrays, V8 typed-array external memory) prevent the collector from compacting around them. A sustained leak of pinned objects fragments the heap and causes OOM at low utilisation.

Fix: explicit lifecycle for pinning; audit all JNI/native interop for unpinned paths. Alert on heap fragmentation metrics (Go: HeapInUse - HeapAlloc; JVM: HeapUsed - LiveSet).

Reference loops with finalizers: mutual strong references between objects that also have finalizers can prevent reclamation even with a cycle collector, because finalizers must run in a defined order the GC cannot always determine. Fix: WeakRef where appropriate; never combine finalizers with circular strong references.

History: 1960 to 2024

Five steps in 64 years:

1960 — McCarthy’s Lisp introduces mark-sweep. First software GC, batch and slow.
1970 — Cheney’s copying collector. Splits heap, copies live, bump-pointer allocation. Still influences V8’s Scavenger.
1984 — Ungar’s generational hypothesis (Berkeley Smalltalk). Most objects die young; exploit it.
1990s — Incremental and concurrent GCs (Baker, Yuasa, Dijkstra abstract framework). Pauses drop from seconds to tens of ms.
2010s–2020s — Low-pause concurrent collectors (G1, ZGC, Shenandoah, Go’s tri-color, V8 Orinoco). Sub-ms pauses on multi-GB heaps. Closed-loop pacers, generational ZGC, energy-aware tuning for cloud workloads.

Each generation lowered pause cost by an order of magnitude. Senior engineers know enough of this lineage to read modern collector documentation and recognise which generation’s tradeoffs the docs describe.

Production stories

Discord 2020: chat service tail latency was dominated by GC pauses. Switching to Go 1.14’s improved pacer dropped p99 by 40%.

LinkedIn 2018: migrated a large Kafka cluster from CMS to G1. p99 latency dropped 25–50% and operator burden fell.

Netflix 2022: deployed ZGC across the Cassandra fleet. p99 read latency improved 5–10x with no application code changes.

Twitter 2019: a finalizer storm caused OOM in a JVM service. Replaced with explicit Closeable.

Stripe 2023: a Go service hit GOMEMLIMIT during a traffic spike. The pacer kept memory bounded but throughput dropped 15% — diagnosed and fixed by reducing allocations in the hot path.

Pattern: every large production service has a GC story. Senior engineers operate on the assumption that GC will be a question; the goal is to know when it’s an answer.

Quiz

A Java service uses Object.finalize() on resources that hold file handles. Under load, open file-handle counts spike unpredictably. Most likely cause?

Quiz

An API endpoint accepts a JSON body with no size limit. How does this create an allocation-driven DoS vector?

Order the steps

Order the steps in diagnosing a Go GC death-spiral from first symptom to verified fix:

1 p99 latency alert fires — SLO burn rate elevated
2 Check gctrace or Prometheus GC CPU share — confirm GC% is rising
3 Capture allocation profile via /debug/pprof/allocs
4 Identify the top-N allocation hotspots by cumulative bytes
5 Apply targeted fix (pre-size slice, add sync.Pool, defer JSON encode)
6 Re-profile to confirm allocation rate dropped ≥50%
7 Confirm GC CPU% and p99 returned to baseline

Production GC operations: pause and alloc-rate dashboards surface a regression, the allocation profile pinpoints the hotspot, a targeted fix lands, and re-profiling confirms the return to baseline.

Recall before you leave

01
Walk through diagnosing and resolving a finalizer storm in a production Java service — metrics, structural fix, and how to prevent recurrence.
02
Design a GC observability programme for a 20-service polyglot fleet (Go + JVM + Node). What metrics, alerts, and runbook structure gives on-call SREs the signal to diagnose and fix a GC regression within one hour?

Recap

GC death-spirals appear in gctrace as rising GC% and growing concurrent-mark time; the fix is always allocation reduction first, GOMEMLIMIT second, tuning third. Production observability requires four panels per service: alloc rate, pause histogram, GC CPU share, heap vs live-set — wired to Prometheus, alerting before SLO burns. Finalizers are not for resource management: use explicit close() / try-with-resources / defer instead; finalizer storms cause OOM at low heap utilisation. Every allocation site that scales with attacker-controlled input is a DoS vector; enforce body size limits at the gateway and result-size limits at the query layer. Pinned objects fragment the heap; audit JNI and native interop. The one-page runbook — quick triage, common causes per language, fix priority, verification checklist — is what separates teams that prevent GC fires from teams that chase them. Now when GC% starts climbing in your gctrace, you have the full map: read the signal, grab the alloc profile, and target the widest leaf — not the collector flag.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

Measuring the heapsenior

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.