awesome-everything RU
↑ Back to the climb

Databases

Real-world MVCC failures, deployment patterns, and distributed snapshots

Crux Long transactions, orphan replication slots, pg_repack, pgbouncer SET LOCAL, idle_in_transaction_session_timeout, Postgres version history 8.3–18, and how distributed systems extend or replace single-node MVCC.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 20 min

Sentry 2017: autovacuum fell behind, XID wraparound triggered, the cluster refused new connections, and the team did manual freezing under fire. GitLab 2017: a misclick during maintenance deleted the wrong directory, and the surrounding bloat made pg_dump recovery take orders of magnitude longer. Both incidents trace back to the same root: MVCC’s storage tax collected without discipline.

Real-world failures: long-transaction bloat and orphan replication slots

The canonical production failure pattern:

  1. A 6-hour ETL or analytical SELECT runs against the OLTP cluster
  2. Its backend_xmin pins the global oldest-xmin
  3. Every orders table autovacuum logs: cannot remove tuples, oldest xmin: XXXXXXXX
  4. Overnight, orders grows from 80 GB to 180 GB
  5. Morning team discovers a disk-full alert at 7 AM

Replication slots that go orphan when a downstream consumer dies produce an identical failure. The slot’s xmin column holds the global oldest-xmin until someone notices and drops the slot.

Diagnosing the pin:

-- Find the oldest snapshot holder
SELECT pid, backend_xmin,
       now() - xact_start AS age,
       state, query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin
LIMIT 5;

-- Find orphan replication slot pins
SELECT slot_name, xmin, confirmed_flush_lsn
FROM pg_replication_slots
WHERE xmin IS NOT NULL;

The smallest backend_xmin or slot xmin matches what autovacuum logs as “oldest xmin.” Fix: pg_terminate_backend(pid) or pg_drop_replication_slot('name'). The very next autovacuum cycle then reclaims the dead tuples.

Mitigation strategies:

  • Route analytics to a replica with hot_standby_feedback = off so the primary’s autovacuum is never held back
  • Use a logical replica with its own independent autovacuum policy
  • Set SET TRANSACTION ISOLATION LEVEL REPEATABLE READ with a statement_timeout of 30 minutes — forces long queries to fail loud instead of pinning silently

The hot_standby_feedback = off setting deserves emphasis: when it is on (the default), the replica sends its xmin back to the primary, which prevents the primary from vacuuming rows the replica still needs. A 6-hour analytics query on the replica with hot_standby_feedback = on is indistinguishable from a 6-hour query on the primary — both pin bloat. Turn it off on analytics replicas and let the replica accept its own bloat.

Misconception: “Postgres has snapshot-too-old like Oracle”

Deployment patterns: pg_repack, pgbouncer, idle_in_transaction_session_timeout

pg_repack rewrites a bloated table online without a long ACCESS EXCLUSIVE lock:

  1. Creates a shadow copy of the table
  2. Applies WAL changes incrementally as the copy is built
  3. Atomically swaps the shadow copy for the original (brief lock for the swap only)

The catch: transient 2x disk usage during the rewrite. Production discipline: run pg_repack nightly on the top-10 most-bloated tables.

pgbouncer in transaction mode multiplexes thousands of client connections onto a small pool of Postgres backends. The MVCC interaction: session-mode SET commands leak across clients in transaction mode.

-- This leaks in transaction mode — wrong:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Use SET LOCAL — scoped to the current transaction:
SET LOCAL TRANSACTION ISOLATION LEVEL REPEATABLE READ;

SET LOCAL is automatically rolled back at transaction end, so the next client picks up a clean connection.

idle_in_transaction_session_timeout (default: off; production recommendation: 5–15 minutes) kills sessions that have an open transaction without activity. Many production incidents trace to an application that opens a transaction at request start and forgets to close it on an error path — idle_in_transaction_session_timeout is the safety net.

-- Set cluster-wide in postgresql.conf:
idle_in_transaction_session_timeout = '10min'
-- Or per role:
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '5min';

Numbers

Operational MVCC numbers
HOT chain length (typical)
4–8 versions before break
Default fillfactor
100 (tables), 90 (indexes)
MultiXact wraparound
2B, separate from XID
Freeze trigger
autovacuum_freeze_max_age 200M
SSI false-positive rate
0.1–1% healthy; >5% tune
pg_repack online cost
transient 2x disk
idle_in_transaction_session_timeout
5–15 min recommended
Logical slot lag risk
orphan slot pins xmin

Autovacuum log diagnosis

Debug this

Diagnose the autovacuum log line — what is wrong and how do you fix it?

log
LOG:  automatic vacuum of table "production.public.orders": index scans: 1
  pages: 0 removed, 18475821 remain, 1854000 skipped due to pins
  tuples: 0 removed, 412809321 remain, 287163445 are dead but not yet removable, oldest xmin: 28391456
  index scan needed: 32140 pages from table (0.17% of total) had 281422 dead item identifiers removed
  avg read rate: 28.412 MB/s, avg write rate: 5.211 MB/s
  buffer usage: 4189213 hits, 8923214 misses, 21430 dirtied
  WAL usage: 41203 records, 0 full page images, 9241221 bytes
  system usage: CPU: user: 312.18 s, system: 18.91 s, elapsed: 1842.21 s

The vacuum ran for 30 minutes, removed 0 dead tuples, and 287 million are not-yet-removable. What is the root cause and how do you confirm + fix?

Postgres version history: 8.3 → 18

VersionYearMVCC-relevant feature
8.32008HOT updates
9.12011SSI (true SERIALIZABLE)
9.52016INSERT ... ON CONFLICT (upsert, interacts subtly with MVCC via internal retry on unique-violation)
122019Declarative partitioning — multi-billion-row tables finally feasible
132021Incremental sort, lower VACUUM cost on heavy-update tables
142021Bottom-up index deletion — dramatically reduces index bloat for UPDATE-heavy patterns where indexed columns rarely change
162023Cold-start freezing performance improvements
172024Vacuum re-architected to use TID stores instead of arrays — removes the maintenance_work_mem ceiling on dead tuples per pass (eliminates “VACUUM has reached the maintenance_work_mem limit, restarting”)
182025Better skip-scan during freezing

Cross-protocol: logical replication, BDR, Citus

Logical replication uses MVCC under the hood. The WAL decoder reads tuple changes from WAL and produces a stream the subscriber applies. The subscriber runs its own independent autovacuum. Logical slots pin the publisher’s xmin so the subscriber can catch up — an idle subscriber is a slow-motion bloat bomb. Monitor pg_replication_slots.xmin and alert on orphaned slots.

BDR (BiDirectional Replication, EDB) implements eventual consistency on top of logical replication with conflict resolution that explicitly understands MVCC: last-update-wins by tuple xmin timestamp.

Citus shards a table across worker nodes. Each shard is a real Postgres table with its own MVCC. The coordinator’s snapshot is propagated to workers as a “global snapshot” passed in the worker connection. Distributed serializable transactions on Citus require all worker nodes to agree on SSI dependencies — implemented via a coordinator-managed predicate-lock table.

Distributed snapshots and global serializability

Single-node Postgres has a global snapshot per backend, computed in shared memory. When sharding across multiple Postgres instances (Citus, YugabyteDB, Postgres-XC ancestors), no single shared-memory snapshot exists.

The standard approach is hybrid logical clocks (HLC): each instance maintains a logical timestamp that monotonically advances, transactions are tagged with HLC values at start, and visibility decisions on remote shards use HLC ordering. HLC needs to be consistent with wall-clock ordering enough that humans can reason about “the transaction at 12:00:01” — TrueTime in Spanner is one solution; HLC in YugabyteDB is another.

SSI across shards requires either:

  • A centralized predicate-lock table (correct but slow)
  • A deterministic concurrency-control protocol (Calvin-style; FaunaDB; CockroachDB’s hybrid) that linearizes at start instead of commit

Citus chooses pragmatism: distributed transactions default to read-committed-with-prepared-commit, and applications that need stronger isolation co-locate dependent rows on the same shard to keep them on the same backend.

Quizzes

Quiz

An orphan replication slot is causing runaway bloat. Which system view confirms the slot's xmin and tells you how far it has fallen behind?

Quiz

In pgbouncer transaction mode, why must you use SET LOCAL instead of SET for isolation level changes?

Quiz

pg_repack rewrites a table online. What is the primary operational cost compared to VACUUM FULL?

Recall before you leave
  1. 01
    A 200 GB orders table is not being vacuumed despite autovacuum running. The log says 287M tuples are not-yet-removable. Walk through the diagnosis and fix.
  2. 02
    Why is hot_standby_feedback=on dangerous for the primary's bloat, and when is it acceptable?
  3. 03
    Explain why Citus defaults to read-committed-with-prepared-commit for distributed transactions rather than distributed SSI.
Recap
  • Long transactions and orphan replication slots pin oldest-xmin and block autovacuum — same failure shape, different source
  • Diagnose with pg_stat_activity WHERE backend_xmin IS NOT NULL and pg_replication_slots WHERE xmin IS NOT NULL
  • idle_in_transaction_session_timeout (5–15 min) kills forgotten open transactions before they accumulate bloat
  • hot_standby_feedback = off on analytics replicas decouples primary autovacuum from replica query lifetimes
  • pg_repack rewrites bloated tables online at the cost of transient 2x disk; VACUUM FULL requires a long lock
  • In pgbouncer transaction mode, use SET LOCAL not SET — session-mode settings leak across pooled clients
  • Postgres 17 removed the maintenance_work_mem ceiling on vacuum by replacing arrays with TID stores
  • Distributed MVCC uses HLC for cross-shard visibility; SSI across shards requires centralized predicate locks or deterministic CC
Connected lessons
appears again in140
Continue the climb ↑MVCC and isolation: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources6
expand
  1. 01
  2. 02
  3. 03
  4. 04
  5. 05
  6. 06

Trademarks belong to their respective owners. Editorial reference only.