Databases DB · 04 · 07

Real-world MVCC failures, deployment patterns, and distributed snapshots

Long transactions, orphan replication slots, pg_repack, pgbouncer SET LOCAL, idle_in_transaction_session_timeout, Postgres version history 8.3–18, and how distributed systems extend or replace single-node MVCC.

DB Senior ◷ 20 min

Level

FoundationsJuniorMiddleSenior

Sentry 2017: autovacuum fell behind, XID wraparound triggered, the cluster refused new connections, and the team did manual freezing under fire. GitLab 2017: a misclick during maintenance deleted the wrong directory, and the surrounding bloat made pg_dump recovery take orders of magnitude longer. Both incidents trace back to the same root: MVCC’s storage tax collected without discipline.

Real-world failures: long-transaction bloat and orphan replication slots

The canonical production failure pattern:

A 6-hour ETL or analytical SELECT runs against the OLTP cluster
Its backend_xmin pins the global oldest-xmin
Every orders table autovacuum logs: cannot remove tuples, oldest xmin: XXXXXXXX
Overnight, orders grows from 80 GB to 180 GB
Morning team discovers a disk-full alert at 7 AM

Replication slots that go orphan when a downstream consumer dies produce an identical failure. The slot’s xmin column holds the global oldest-xmin until someone notices and drops the slot.

Diagnosing the pin:

-- Find the oldest snapshot holder
SELECT pid, backend_xmin,
       now() - xact_start AS age,
       state, query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin
LIMIT 5;

-- Find orphan replication slot pins
SELECT slot_name, xmin, confirmed_flush_lsn
FROM pg_replication_slots
WHERE xmin IS NOT NULL;

The smallest backend_xmin or slot xmin matches what autovacuum logs as “oldest xmin.” Fix: pg_terminate_backend(pid) or pg_drop_replication_slot('name'). The very next autovacuum cycle then reclaims the dead tuples.

Mitigation strategies:

Route analytics to a replica with hot_standby_feedback = off so the primary’s autovacuum is never held back
Use a logical replica with its own independent autovacuum policy
Set SET TRANSACTION ISOLATION LEVEL REPEATABLE READ with a statement_timeout of 30 minutes — forces long queries to fail loud instead of pinning silently

Together these three approaches share one goal: ensure no long-running query ever touches the primary’s global oldest-xmin. Without at least one of them, a single overnight analytics job can turn an 80 GB table into 180 GB by morning.

The trifecta is independent: each lever closes a different pin source — forgotten open transactions, replica feedback, and orphan slots — so any one of them on its own still leaves a gap.

The hot_standby_feedback = off setting deserves emphasis: when it is on (the default), the replica sends its xmin back to the primary, which prevents the primary from vacuuming rows the replica still needs. A 6-hour analytics query on the replica with hot_standby_feedback = on is indistinguishable from a 6-hour query on the primary — both pin bloat. Turn it off on analytics replicas and let the replica accept its own bloat.

Misconception: “Postgres has snapshot-too-old like Oracle”

Deployment patterns: pg_repack, pgbouncer, idle_in_transaction_session_timeout

pg_repack rewrites a bloated table online without a long ACCESS EXCLUSIVE lock:

Creates a shadow copy of the table
Applies WAL changes incrementally as the copy is built
Atomically swaps the shadow copy for the original (brief lock for the swap only)

The catch: transient 2x disk usage during the rewrite. Production discipline: run pg_repack nightly on the top-10 most-bloated tables.

pgbouncer in transaction mode multiplexes thousands of client connections onto a small pool of Postgres backends. The MVCC interaction: session-mode SET commands leak across clients in transaction mode.

-- This leaks in transaction mode — wrong:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Use SET LOCAL — scoped to the current transaction:
SET LOCAL TRANSACTION ISOLATION LEVEL REPEATABLE READ;

SET LOCAL is automatically rolled back at transaction end, so the next client picks up a clean connection.

idle_in_transaction_session_timeout (default: off; production recommendation: 5–15 minutes) kills sessions that have an open transaction without activity. Many production incidents trace to an application that opens a transaction at request start and forgets to close it on an error path — idle_in_transaction_session_timeout is the safety net.

-- Set cluster-wide in postgresql.conf:
idle_in_transaction_session_timeout = '10min'
-- Or per role:
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '5min';

Numbers

Operational MVCC numbers

HOT chain length (typical): 4–8 versions before break
Default fillfactor: 100 (tables), 90 (indexes)
MultiXact wraparound: 2B, separate from XID
Freeze trigger: autovacuum_freeze_max_age 200M
SSI false-positive rate: 0.1–1% healthy; >5% tune
pg_repack online cost: transient 2x disk
idle_in_transaction_session_timeout: 5–15 min recommended
Logical slot lag risk: orphan slot pins xmin

Autovacuum log diagnosis

Debug this

Diagnose the autovacuum log line — what is wrong and how do you fix it?

log

LOG:  automatic vacuum of table "production.public.orders": index scans: 1
  pages: 0 removed, 18475821 remain, 1854000 skipped due to pins
  tuples: 0 removed, 412809321 remain, 287163445 are dead but not yet removable, oldest xmin: 28391456
  index scan needed: 32140 pages from table (0.17% of total) had 281422 dead item identifiers removed
  avg read rate: 28.412 MB/s, avg write rate: 5.211 MB/s
  buffer usage: 4189213 hits, 8923214 misses, 21430 dirtied
  WAL usage: 41203 records, 0 full page images, 9241221 bytes
  system usage: CPU: user: 312.18 s, system: 18.91 s, elapsed: 1842.21 s

The vacuum ran for 30 minutes, removed 0 dead tuples, and 287 million are not-yet-removable. What is the root cause and how do you confirm + fix?

Postgres version history: 8.3 → 18

Version	Year	MVCC-relevant feature
8.3	2008	HOT updates
9.1	2011	SSI (true SERIALIZABLE)
9.5	2016	`INSERT ... ON CONFLICT` (upsert, interacts subtly with MVCC via internal retry on unique-violation)
12	2019	Declarative partitioning — multi-billion-row tables finally feasible
13	2021	Incremental sort, lower VACUUM cost on heavy-update tables
14	2021	Bottom-up index deletion — dramatically reduces index bloat for UPDATE-heavy patterns where indexed columns rarely change
16	2023	Cold-start freezing performance improvements
17	2024	Vacuum re-architected to use TID stores instead of arrays — removes the `maintenance_work_mem` ceiling on dead tuples per pass (eliminates “VACUUM has reached the maintenance_work_mem limit, restarting”)
18	2025	Better skip-scan during freezing

Cross-protocol: logical replication, BDR, Citus

Logical replication uses MVCC under the hood. The WAL decoder reads tuple changes from WAL and produces a stream the subscriber applies. The subscriber runs its own independent autovacuum. Logical slots pin the publisher’s xmin so the subscriber can catch up — an idle subscriber is a slow-motion bloat bomb. Monitor pg_replication_slots.xmin and alert on orphaned slots.

BDR (BiDirectional Replication, EDB) implements eventual consistency on top of logical replication with conflict resolution that explicitly understands MVCC: last-update-wins by tuple xmin timestamp.

Citus shards a table across worker nodes. Each shard is a real Postgres table with its own MVCC. The coordinator’s snapshot is propagated to workers as a “global snapshot” passed in the worker connection. Distributed serializable transactions on Citus require all worker nodes to agree on SSI dependencies — implemented via a coordinator-managed predicate-lock table.

Distributed snapshots and global serializability

Single-node Postgres has a global snapshot per backend, computed in shared memory. When sharding across multiple Postgres instances (Citus, YugabyteDB, Postgres-XC ancestors), no single shared-memory snapshot exists.

The standard approach is hybrid logical clocks (HLC): each instance maintains a logical timestamp that monotonically advances, transactions are tagged with HLC values at start, and visibility decisions on remote shards use HLC ordering. HLC needs to be consistent with wall-clock ordering enough that humans can reason about “the transaction at 12:00:01” — TrueTime in Spanner is one solution; HLC in YugabyteDB is another.

SSI across shards requires either:

A centralized predicate-lock table (correct but slow)
A deterministic concurrency-control protocol (Calvin-style; FaunaDB; CockroachDB’s hybrid) that linearizes at start instead of commit

Citus chooses pragmatism: distributed transactions default to read-committed-with-prepared-commit, and applications that need stronger isolation co-locate dependent rows on the same shard to keep them on the same backend.

Time flows along the arrows. Terminating the transaction or dropping the orphan slot releases oldest-xmin; the next autovacuum cycle then reclaims the pinned dead tuples. idle_in_transaction_session_timeout prevents the pin in the first place.

Quizzes

Quiz

An orphan replication slot is causing runaway bloat. Which system view confirms the slot's xmin and tells you how far it has fallen behind?

Quiz

In pgbouncer transaction mode, why must you use SET LOCAL instead of SET for isolation level changes?

Quiz

pg_repack rewrites a table online. What is the primary operational cost compared to VACUUM FULL?

Recall before you leave

01
A 200 GB orders table is not being vacuumed despite autovacuum running. The log says 287M tuples are not-yet-removable. Walk through the diagnosis and fix.
02
Why is hot_standby_feedback=on dangerous for the primary's bloat, and when is it acceptable?
03
Explain why Citus defaults to read-committed-with-prepared-commit for distributed transactions rather than distributed SSI.

Recap

Long transactions and orphan replication slots pin oldest-xmin and block autovacuum — same failure shape, different source
Diagnose with pg_stat_activity WHERE backend_xmin IS NOT NULL and pg_replication_slots WHERE xmin IS NOT NULL
idle_in_transaction_session_timeout (5–15 min) kills forgotten open transactions before they accumulate bloat
hot_standby_feedback = off on analytics replicas decouples primary autovacuum from replica query lifetimes
pg_repack rewrites bloated tables online at the cost of transient 2x disk; VACUUM FULL requires a long lock
In pgbouncer transaction mode, use SET LOCAL not SET — session-mode settings leak across pooled clients
Postgres 17 removed the maintenance_work_mem ceiling on vacuum by replacing arrays with TID stores
Distributed MVCC uses HLC for cross-shard visibility; SSI across shards requires centralized predicate locks or deterministic CC
Now when you find autovacuum running for 30 minutes and removing zero tuples, you can reproduce Sentry’s diagnosis in under two minutes: one query on pg_stat_activity, one on pg_replication_slots, and a single pg_terminate_backend call to unlock the backlog.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.