Databases
Real-world MVCC failures, deployment patterns, and distributed snapshots
Sentry 2017: autovacuum fell behind, XID wraparound triggered, the cluster refused new connections, and the team did manual freezing under fire. GitLab 2017: a misclick during maintenance deleted the wrong directory, and the surrounding bloat made pg_dump recovery take orders of magnitude longer. Both incidents trace back to the same root: MVCC’s storage tax collected without discipline.
Real-world failures: long-transaction bloat and orphan replication slots
The canonical production failure pattern:
- A 6-hour ETL or analytical
SELECTruns against the OLTP cluster - Its
backend_xminpins the global oldest-xmin - Every
orderstable autovacuum logs:cannot remove tuples, oldest xmin: XXXXXXXX - Overnight,
ordersgrows from 80 GB to 180 GB - Morning team discovers a disk-full alert at 7 AM
Replication slots that go orphan when a downstream consumer dies produce an identical failure. The slot’s xmin column holds the global oldest-xmin until someone notices and drops the slot.
Diagnosing the pin:
-- Find the oldest snapshot holder
SELECT pid, backend_xmin,
now() - xact_start AS age,
state, query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin
LIMIT 5;
-- Find orphan replication slot pins
SELECT slot_name, xmin, confirmed_flush_lsn
FROM pg_replication_slots
WHERE xmin IS NOT NULL;The smallest backend_xmin or slot xmin matches what autovacuum logs as “oldest xmin.” Fix: pg_terminate_backend(pid) or pg_drop_replication_slot('name'). The very next autovacuum cycle then reclaims the dead tuples.
Mitigation strategies:
- Route analytics to a replica with
hot_standby_feedback = offso the primary’s autovacuum is never held back - Use a logical replica with its own independent autovacuum policy
- Set
SET TRANSACTION ISOLATION LEVEL REPEATABLE READwith astatement_timeoutof 30 minutes — forces long queries to fail loud instead of pinning silently
The hot_standby_feedback = off setting deserves emphasis: when it is on (the default), the replica sends its xmin back to the primary, which prevents the primary from vacuuming rows the replica still needs. A 6-hour analytics query on the replica with hot_standby_feedback = on is indistinguishable from a 6-hour query on the primary — both pin bloat. Turn it off on analytics replicas and let the replica accept its own bloat.
Misconception: “Postgres has snapshot-too-old like Oracle”
Deployment patterns: pg_repack, pgbouncer, idle_in_transaction_session_timeout
pg_repack rewrites a bloated table online without a long ACCESS EXCLUSIVE lock:
- Creates a shadow copy of the table
- Applies WAL changes incrementally as the copy is built
- Atomically swaps the shadow copy for the original (brief lock for the swap only)
The catch: transient 2x disk usage during the rewrite. Production discipline: run pg_repack nightly on the top-10 most-bloated tables.
pgbouncer in transaction mode multiplexes thousands of client connections onto a small pool of Postgres backends. The MVCC interaction: session-mode SET commands leak across clients in transaction mode.
-- This leaks in transaction mode — wrong:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Use SET LOCAL — scoped to the current transaction:
SET LOCAL TRANSACTION ISOLATION LEVEL REPEATABLE READ;SET LOCAL is automatically rolled back at transaction end, so the next client picks up a clean connection.
idle_in_transaction_session_timeout (default: off; production recommendation: 5–15 minutes) kills sessions that have an open transaction without activity. Many production incidents trace to an application that opens a transaction at request start and forgets to close it on an error path — idle_in_transaction_session_timeout is the safety net.
-- Set cluster-wide in postgresql.conf:
idle_in_transaction_session_timeout = '10min'
-- Or per role:
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '5min';Numbers
- HOT chain length (typical)
- 4–8 versions before break
- Default fillfactor
- 100 (tables), 90 (indexes)
- MultiXact wraparound
- 2B, separate from XID
- Freeze trigger
- autovacuum_freeze_max_age 200M
- SSI false-positive rate
- 0.1–1% healthy; >5% tune
- pg_repack online cost
- transient 2x disk
- idle_in_transaction_session_timeout
- 5–15 min recommended
- Logical slot lag risk
- orphan slot pins xmin
Autovacuum log diagnosis
Diagnose the autovacuum log line — what is wrong and how do you fix it?
LOG: automatic vacuum of table "production.public.orders": index scans: 1
pages: 0 removed, 18475821 remain, 1854000 skipped due to pins
tuples: 0 removed, 412809321 remain, 287163445 are dead but not yet removable, oldest xmin: 28391456
index scan needed: 32140 pages from table (0.17% of total) had 281422 dead item identifiers removed
avg read rate: 28.412 MB/s, avg write rate: 5.211 MB/s
buffer usage: 4189213 hits, 8923214 misses, 21430 dirtied
WAL usage: 41203 records, 0 full page images, 9241221 bytes
system usage: CPU: user: 312.18 s, system: 18.91 s, elapsed: 1842.21 s The vacuum ran for 30 minutes, removed 0 dead tuples, and 287 million are not-yet-removable. What is the root cause and how do you confirm + fix?
Postgres version history: 8.3 → 18
| Version | Year | MVCC-relevant feature |
|---|---|---|
| 8.3 | 2008 | HOT updates |
| 9.1 | 2011 | SSI (true SERIALIZABLE) |
| 9.5 | 2016 | INSERT ... ON CONFLICT (upsert, interacts subtly with MVCC via internal retry on unique-violation) |
| 12 | 2019 | Declarative partitioning — multi-billion-row tables finally feasible |
| 13 | 2021 | Incremental sort, lower VACUUM cost on heavy-update tables |
| 14 | 2021 | Bottom-up index deletion — dramatically reduces index bloat for UPDATE-heavy patterns where indexed columns rarely change |
| 16 | 2023 | Cold-start freezing performance improvements |
| 17 | 2024 | Vacuum re-architected to use TID stores instead of arrays — removes the maintenance_work_mem ceiling on dead tuples per pass (eliminates “VACUUM has reached the maintenance_work_mem limit, restarting”) |
| 18 | 2025 | Better skip-scan during freezing |
Cross-protocol: logical replication, BDR, Citus
Logical replication uses MVCC under the hood. The WAL decoder reads tuple changes from WAL and produces a stream the subscriber applies. The subscriber runs its own independent autovacuum. Logical slots pin the publisher’s xmin so the subscriber can catch up — an idle subscriber is a slow-motion bloat bomb. Monitor pg_replication_slots.xmin and alert on orphaned slots.
BDR (BiDirectional Replication, EDB) implements eventual consistency on top of logical replication with conflict resolution that explicitly understands MVCC: last-update-wins by tuple xmin timestamp.
Citus shards a table across worker nodes. Each shard is a real Postgres table with its own MVCC. The coordinator’s snapshot is propagated to workers as a “global snapshot” passed in the worker connection. Distributed serializable transactions on Citus require all worker nodes to agree on SSI dependencies — implemented via a coordinator-managed predicate-lock table.
Distributed snapshots and global serializability
Single-node Postgres has a global snapshot per backend, computed in shared memory. When sharding across multiple Postgres instances (Citus, YugabyteDB, Postgres-XC ancestors), no single shared-memory snapshot exists.
The standard approach is hybrid logical clocks (HLC): each instance maintains a logical timestamp that monotonically advances, transactions are tagged with HLC values at start, and visibility decisions on remote shards use HLC ordering. HLC needs to be consistent with wall-clock ordering enough that humans can reason about “the transaction at 12:00:01” — TrueTime in Spanner is one solution; HLC in YugabyteDB is another.
SSI across shards requires either:
- A centralized predicate-lock table (correct but slow)
- A deterministic concurrency-control protocol (Calvin-style; FaunaDB; CockroachDB’s hybrid) that linearizes at start instead of commit
Citus chooses pragmatism: distributed transactions default to read-committed-with-prepared-commit, and applications that need stronger isolation co-locate dependent rows on the same shard to keep them on the same backend.
Quizzes
An orphan replication slot is causing runaway bloat. Which system view confirms the slot's xmin and tells you how far it has fallen behind?
In pgbouncer transaction mode, why must you use SET LOCAL instead of SET for isolation level changes?
pg_repack rewrites a table online. What is the primary operational cost compared to VACUUM FULL?
- 01A 200 GB orders table is not being vacuumed despite autovacuum running. The log says 287M tuples are not-yet-removable. Walk through the diagnosis and fix.
- 02Why is hot_standby_feedback=on dangerous for the primary's bloat, and when is it acceptable?
- 03Explain why Citus defaults to read-committed-with-prepared-commit for distributed transactions rather than distributed SSI.
- Long transactions and orphan replication slots pin
oldest-xminand block autovacuum — same failure shape, different source - Diagnose with
pg_stat_activity WHERE backend_xmin IS NOT NULLandpg_replication_slots WHERE xmin IS NOT NULL idle_in_transaction_session_timeout(5–15 min) kills forgotten open transactions before they accumulate bloathot_standby_feedback = offon analytics replicas decouples primary autovacuum from replica query lifetimespg_repackrewrites bloated tables online at the cost of transient 2x disk;VACUUM FULLrequires a long lock- In
pgbouncertransaction mode, useSET LOCALnotSET— session-mode settings leak across pooled clients - Postgres 17 removed the
maintenance_work_memceiling on vacuum by replacing arrays with TID stores - Distributed MVCC uses HLC for cross-shard visibility; SSI across shards requires centralized predicate locks or deterministic CC
appears again in140
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior