Performance
GC internals: tri-color invariant, write barriers, and per-runtime deep-dives
A JVM service migrates from G1 to ZGC. Pauses drop from 60 ms to sub-millisecond on a 16 GB heap — but throughput drops 12% and memory use climbs 18%. Understanding why requires knowing what colored pointers are and what load barriers cost.
Tri-color marking and the write barrier
Tri-color abstraction (Dijkstra 1978) is the formal foundation of concurrent GC. Objects are classified into three colors:
- White — not yet visited; candidate for collection if marking ends while still white.
- Grey — visited, but children not yet fully scanned.
- Black — visited, all children scanned; considered live.
Marking moves grey objects to black by scanning their children and greying each unvisited child. When no grey objects remain, all white objects are unreachable and may be reclaimed.
The fundamental invariant: a black object must never directly reference a white object. If a mutator thread writes a reference from a black object into a white object’s field after the black was scanned, that white object becomes unreachable in the GC’s view but reachable in the program’s. The collector would reclaim live memory — silent heap corruption.
SATB vs incremental-update barriers
The write barrier prevents invariant violations by intercepting every reference write:
Snapshot-at-the-beginning (SATB): the barrier marks the old reference about to be overwritten, ensuring it survives this cycle. The collector behaves as if it snapshotted the heap at GC start. Used by G1, Shenandoah, ZGC, and Go’s hybrid Yuasa-style barrier.
Incremental-update (Dijkstra-style): the barrier marks the new reference being written into a black object, ensuring the newly pointed-to object is scanned before the cycle ends. Used by CMS and classic V8 mark-compact.
SATB is more conservative — it may preserve objects that became garbage during the cycle (floating garbage, reclaimed in the next cycle). But it gives stronger guarantees about marking termination and is simpler to reason about. Incremental-update may require a re-marking phase to fix up changes missed during concurrent marking.
Both cost 2–10% CPU on every reference write — the price of concurrent marking without stop-the-world pauses.
| Barrier type | What it marks | Used by | Side effect |
|---|---|---|---|
| SATB | Old reference (pre-write) | G1, Shenandoah, ZGC, Go | Floating garbage (one cycle delay) |
| Incremental-update | New reference (post-write) | CMS, classic V8 | May need re-mark phase |
Why this works
Write barriers matter for write-heavy hot paths. A service that writes millions of references per second (e.g. updating a large in-memory graph) pays the barrier cost on every write. On most CRUD services this is negligible; on graph-mutation or event-sourcing workloads it shows up in profiles as runtime.wbBufFlush (Go) or similar GC frame names. Know your write pattern before claiming the barrier is free.
Go’s pacer redesign
Go 1.18’s GC pacer rewrite (proposal 44167, by Michael Knyszek) replaced heuristics with a closed-loop control system. The old pacer estimated when to start the next GC cycle so it would finish just before the heap doubled; it had instability at high allocation rates and made poor decisions on cgo-heavy workloads.
The new pacer uses a PI controller (proportional-integral) on two signals: heap-growth rate and GC CPU utilisation. The controller targets GC finishing just before the heap reaches the goal (GOGC-derived), with integral feedback preventing sustained drift.
GOMEMLIMIT (added Go 1.19) integrates into the pacer: as the process approaches the limit, the pacer pulls GC forward — accepting higher GC CPU — to prevent OOM. When the limit is respected, the pacer backs off.
Senior production advice: set GOMEMLIMIT to ~90% of the container’s memory limit; leave GOGC at the default 100 unless profiling shows a specific reason to change it. GOGC=off is only safe for memory-bounded batch jobs that deallocate via process exit.
The redesign reduced pause variance by ~50% on real workloads. Reading: Knyszek’s GopherCon 2022 talk on the pacer redesign.
ZGC and colored pointers
ZGC (JEP 333, JDK 11 experimental; production in JDK 15 via JEP 377) achieves sub-millisecond pauses on heaps up to 16 TB using two innovations:
Colored pointers pack metadata bits into the 64-bit pointer itself. ZGC uses bits 0–41 for the address (capping the heap at ~4 TB), and bits 42–45 for marking state — “good” colors vs “bad” colors indicating relocation or pending work.
Load barriers intercept every heap load (every pointer dereference). If the color is “bad,” the barrier triggers a slow path to update the pointer in-place. Because the barrier runs inline on every load, the application participates in GC’s work incrementally instead of waiting for a big stop-the-world phase.
The result: marking, relocation, and reference processing all happen concurrently. STW phases are limited to root scanning — sub-millisecond even on multi-TB heaps.
The tradeoff: load barriers cost ~5–15% CPU on read-heavy workloads. ZGC also requires multi-mapped heaps for fast relocation, inflating virtual memory significantly (though not physical RSS). The 12% throughput drop and 18% memory increase in the hook scenario are expected ZGC costs — not bugs.
Generational ZGC (JEP 439, JDK 21+) adds a young generation, closing most of the throughput gap with G1. Production teams on JDK 21+ should evaluate generational ZGC when migrating.
V8 Orinoco
V8’s Orinoco project (2017+) moved V8’s GC from mostly stop-the-world to mostly concurrent. Key pieces:
- Concurrent marking: marking runs on a background thread alongside JavaScript execution. Write barriers (SATB-style) maintain consistency with the mutator.
- Parallel compaction: multiple threads move objects in parallel during the STW compaction phase, reducing its duration.
- Young-gen scavenger parallelism: multiple threads evacuate the young heap in parallel.
Result: typical web workloads see pauses ≤10 ms, with most marking work hidden in the background. Memory overhead: ~5–15% for marking infrastructure (write barriers, marking worklist).
Node.js inherits Orinoco by default. Tuning is via --max-old-space-size (old heap cap) and --max-semi-space-size (young heap, affects minor GC frequency). Major Orinoco changes can shift performance characteristics across Node versions — engineering teams should track V8 release notes when upgrading Node.
A service migrated from G1 to ZGC sees pauses drop from 60 ms to <1 ms but throughput drops 12% and RSS grows 18%. Is this expected?
Why does Go's GC use a SATB write barrier instead of an incremental-update barrier?
Order the steps a ZGC load barrier takes when reading a pointer with a 'bad' color:
- 1 Mutator reads a heap reference (pointer dereference)
- 2 Inline load barrier checks the pointer's color bits
- 3 Color is 'bad' — object has been relocated or is pending work
- 4 Barrier triggers slow path: looks up the forwarding table
- 5 Barrier updates the pointer in-place to the new address
- 6 Mutator proceeds with the corrected (healed) pointer
- 01Explain the tri-color invariant and the role of the write barrier in maintaining it during concurrent marking.
- 02What problem did Go 1.18's pacer redesign solve, and what is GOMEMLIMIT's role?
Tri-color marking classifies objects as white, grey, or black and maintains the invariant that no black object directly references a white one. The write barrier enforces this invariant during concurrent marking by intercepting every reference write: SATB marks the old reference (used by Go, G1, ZGC); incremental-update marks the new one (used by CMS, classic V8). Both cost 2–10% CPU. ZGC extends this with colored pointers — metadata bits packed into 64-bit pointers — and load barriers that heal stale pointers inline, achieving sub-millisecond pauses at the cost of ~5–15% throughput and elevated memory. Go’s pacer redesign (1.18) replaced heuristics with a PI controller; GOMEMLIMIT (1.19) gives containerised services a soft memory cap the pacer respects. V8 Orinoco brought concurrent marking and parallel compaction to reduce JavaScript GC pauses to ≤10 ms. Knowing which barrier your runtime uses shapes how you write write-heavy hot paths.
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior