Databases DB · 08 · 03

Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrations

One long transaction can grow a 200 MB table to 80 GB. A cold-start pod surge can turn 4 ms queries into 4 s timeouts. A single ALTER TABLE can freeze production for eight minutes. All three failures share a root cause: resource starvation from an unbounded operation.

DB Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

A nightly report runner holds one transaction open for four hours. It reads a snapshot — no writes, nothing harmful. But VACUUM cannot reclaim dead tuples below the xmin horizon. The table swells from 200 MB to 80 GB. No new rows were inserted. The report runner is the only culprit.

By the end of this lesson you will be able to trace any of these three failure chains from the unbounded root operation to the production symptom — and name the lever that caps it.

Act 4 — MVCC and the xmin horizon

Every row in Postgres carries xmin (creation transaction id) and xmax (deletion transaction id). A DELETE does not free the row — it sets xmax. VACUUM reclaims rows whose xmax is older than the global xmin horizon — the oldest active transaction ID visible across the cluster.

If one analyst session leaves a transaction open for hours, the horizon does not advance. No dead tuple created after that horizon can be reclaimed. The table swells; queries that scan the table get slower because they read dead rows alongside live ones.

The fix is operational:

Kill the long transaction (SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND query_start < now() - interval '5 minutes').
Set idle_in_transaction_session_timeout = 60s — any session that holds a transaction open without executing a query for 60 seconds is terminated.
Run pg_repack to reclaim disk without an AccessExclusiveLock (VACUUM FULL takes one).
Permanent fix: move reporting to a read replica; the xmin horizon on a replica does not block vacuum on the primary when hot_standby_feedback = off.

Together these steps break the bloat loop: step 1 stops the bleeding, step 2 prevents recurrence, step 3 cleans up the damage, and step 4 removes the structural cause. Without step 2, the analyst session reconnects overnight and the cycle restarts.

At 1B rows, one idle session can bloat the table by 10% in a single day. Bloat slows every subsequent query because the heap pages contain dead tuples alongside live ones, forcing the sequentialscanner to read and skip them.

Act 5 — Connection pool sizing math

Each Postgres backend is a forked OS process with its own memory footprint (~5–10 MB RSS). A server with 32 GB RAM can support roughly 2000–3000 backends before the kernel scheduler thrashes and query latency degrades.

At 50 app pods with 20 workers each, a cold-start or pod-rollout event opens 1000 backends simultaneously. Queries that previously took 4 ms take 4 s not because of data volume, but because the OS scheduler cannot efficiently time-slice 1000 processes.

PgBouncer in transaction-mode:

100 server-side backends (fits in kernel scheduler’s sweet spot).
10000 client connections multiplexed onto those 100 backends.
Each backend is used only during active transaction time, then returned to the pool.

Pool sizing math:

pool_size = active_concurrent_transactions × safety_factor

Not max_app_workers. The relevant number is how many transactions hit the database simultaneously, not the total worker count. For a 16-core SSD-only server, 30–60 server-side backends is typically the sweet spot.

The 1000-vs-30–60 gap is the whole point: size the pool on active concurrent transactions, not max_app_workers — PgBouncer collapses 10000 clients onto ~100 backends so the scheduler never thrashes.

The transaction-mode prepared statement trap: PgBouncer in transaction-mode rebinds connections after each transaction, so server-side prepared statements (PREPARE foo AS SELECT ...) are lost between transactions. PgBouncer 1.21+ ships server_prepared_statements = on which caches plan IDs server-side and replays them — without it, app frameworks that auto-prepare statements pay re-prepare overhead on every transaction.

Act 6 — The lock matrix and expand-contract

ALTER TABLE takes AccessExclusiveLock by default — incompatible with every other lock mode including reads. While it waits for the lock, every new query waits behind it. While it holds the lock, every query waits.

A short ALTER under low load completes in milliseconds. An ALTER behind a long-running query (as in Act 4) waits, builds a queue, and the queue freezes the database. A single 60-second lock on a hot table can cascade into a 10-minute incident due to connection timeout and app-level retry storms.

The expand-contract recipe for adding a tenant_id BIGINT NOT NULL column safely:

ALTER TABLE users ADD COLUMN tenant_id BIGINT — nullable, no default. No rewrite.
Backfill: UPDATE users SET tenant_id = 0 WHERE tenant_id IS NULL LIMIT 10000 — batched, low lock pressure.
ALTER TABLE users ADD CONSTRAINT chk_tenant_id CHECK (tenant_id IS NOT NULL) NOT VALID — does not scan the table.
ALTER TABLE users VALIDATE CONSTRAINT chk_tenant_id — scans the table under a ShareUpdateExclusiveLock, compatible with reads and writes.
ALTER TABLE users ALTER COLUMN tenant_id SET NOT NULL — metadata-only after a successful validation, milliseconds.
ALTER TABLE users DROP CONSTRAINT chk_tenant_id — cleanup.

Tools like pgroll and Atlas codify this recipe. Without it, the naive ADD COLUMN ... NOT NULL DEFAULT on Postgres 10 and earlier causes a full table rewrite; on Postgres 11+ it is a metadata change for constant defaults, but the lock still queues all queries.

Acts 4–6: costs and thresholds

Bloat rate on write-heavy table, no autovacuum tuning: 1–10% per month
Bloat rate with idle-in-transaction bug: 50%+ per week
pg_repack time for 80 GB table: 30–90 min online
PgBouncer transaction-mode sweet spot, 16-core SSD server: 30–60 backends
AccessExclusiveLock on hot table → incident duration: up to 10 min
expand-contract: SET NOT NULL after VALIDATE CONSTRAINT: milliseconds

▸Why this works

Why does idle_in_transaction_session_timeout matter more than statement_timeout? A session in idle in transaction state is not executing a query — statement_timeout would never fire. But the session holds a transaction and therefore pins the xmin horizon. Setting a transaction-idle timeout terminates the session before it can freeze vacuum progress for hours.

Quiz

A long-running analytics session holds a transaction open for four hours. No data has been inserted or deleted in that time. Why is the table growing on disk?

Quiz

50 app pods each open 20 workers. At pod-rollout time, query latency jumps from 4 ms to 4 s. Root cause?

Quiz

A migration runs ALTER TABLE ADD COLUMN ... NOT NULL during a period when a pg_dump session holds AccessShareLock on the table. What happens?

One shared shape across all three acts: an unbounded operation (left) starves a resource (middle) and surfaces as a production symptom (right). The lever in each row caps the unbounded operation.

Recall before you leave

01
Explain the xmin horizon: what is it, how does a long transaction pin it, and what is the operational consequence?
02
Why is pool_size = active_concurrent_transactions × safety_factor the correct sizing formula rather than max_app_workers?
03
Walk through the expand-contract recipe for adding a NOT NULL column and explain why each step is lock-cheap.

Recap

Acts 4–6 each fail from one unbounded operation — a long transaction pins the xmin horizon and prevents VACUUM from reclaiming dead tuples, causing unlimited table growth; too many OS-level Postgres backends flood the kernel scheduler and turn 4 ms queries into 4 s timeouts; and an ALTER TABLE that waits for AccessExclusiveLock queues every subsequent query behind it. The levers are: idle_in_transaction_session_timeout = 60s and pg_repack for bloat; a connection pool sized on active concurrent transactions (not max workers) for connection storms; and the expand-contract recipe for every schema change that would otherwise hold an exclusive lock. These three acts require operational discipline — monitoring, alerting, runbooks — not just one-time fixes. Teams that skip them discover the failure modes at the worst possible time: peak traffic, mid-deploy, or during a large batch migration. Now when you see a table growing on disk without new rows, your first action is pg_stat_activity to find the idle-in-transaction session — not VACUUM FULL.

Connected lessons

builds on

deepens into

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.