awesome-everything RU
↑ Back to the climb

Databases

Vacuum and bloat: keeping the storage tax bounded

Crux Autovacuum reclaims dead tuple slots but cannot shrink the file; a long transaction or orphan replication slot pins everything it touched. Understanding the mechanics is the difference between 20 GB and 200 GB tables.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 15 min

An autovacuum worker ran for 30 minutes on your orders table, removed zero dead tuples, and logged “287 million are not yet removable.” The table keeps growing. This is not an autovacuum configuration problem — it is a snapshot pinning problem, and knowing which is which determines the fix.

Where the cost is paid

Each UPDATE creates one fresh tuple plus marks the old tuple dead — both on disk, both in the heap, both eligible for index entries. Until VACUUM clears the dead tuples, every sequential scan walks past them and every page is bigger than the strict minimum.

A common rule of thumb at production scale: target dead-tuple ratio below 20% for tables above 50 GB. Smaller tables tolerate higher bloat (30–50%) because the absolute waste is small.

VACUUM never physically shrinks the file — it just marks space reusable inside the file. Reclaiming disk back to the operating system requires:

  • VACUUM FULL — rewrites the whole table while holding an ACCESS EXCLUSIVE lock, blocking every reader and writer
  • pg_repack extension — rewrites concurrently, swaps at the end (covered in lesson 07)
OperationWhat it reclaimsBlocks reads/writes?Shrinks file?
Regular VACUUM / autovacuumDead tuple slots → reusableNo (SHARE UPDATE EXCLUSIVE)No
VACUUM FULLRewrites whole tableYes (ACCESS EXCLUSIVE)Yes
pg_repackRewrites concurrentlyBrief swap onlyYes

How autovacuum actually decides to run

Autovacuum is a background process pool (3 workers by default) that wakes every minute and inspects every table for two threshold conditions:

  • Dead-tuple threshold: autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup (default: 50 + 20% of live tuples)
  • Insert threshold: similar formula for fresh inserts that need ANALYZE

When a table crosses a threshold, autovacuum dispatches a worker. The worker takes a SHARE UPDATE EXCLUSIVE lock — strong enough to block other VACUUMs and schema changes, weak enough to let regular reads and writes proceed in parallel.

The worker computes the global oldest xmin across all sessions (the cap on reclaimable tuples), scans the heap, identifies dead tuples whose t_xmax is older than oldest xmin, marks their slots reusable, walks each index to remove stale pointers, and rebuilds the table’s free-space map.

Throughout, autovacuum is rate-limited by a cost-based delay: every page read costs a few units, every page write costs more units, and once the worker exceeds autovacuum_vacuum_cost_limit, it sleeps for autovacuum_vacuum_cost_delay milliseconds.

Order the steps

Put the steps of an autovacuum run on a bloated table in order:

  1. 1 Scheduler wakes up; checks pg_stat_all_tables.n_dead_tup vs autovacuum_vacuum_threshold + scale_factor × n_live_tup
  2. 2 Worker connects to the database; acquires a SHARE UPDATE EXCLUSIVE lock on the table (concurrent reads + writes continue)
  3. 3 Worker computes the oldest xmin across all sessions (pg_stat_activity.backend_xmin) — only tuples older than this can be removed
  4. 4 Heap scan: visit each page, identify dead tuples whose t_xmax < oldest xmin
  5. 5 Mark dead tuple slots reusable inside each page; rebuild the page's free-space map entry
  6. 6 Index cleanup: walk each index and remove entries pointing at reclaimed heap slots
  7. 7 Update pg_class.relfrozenxid + pg_stat_all_tables counters; release lock

What pins the oldest xmin

The most common cause of bloat that autovacuum cannot reclaim: a long-running transaction or orphan replication slot is holding back the global oldest xmin.

Diagnose with:

SELECT pid, backend_xmin, now() - xact_start AS duration
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin LIMIT 5;

SELECT slot_name, xmin FROM pg_replication_slots WHERE xmin IS NOT NULL;

Fix: terminate the long transaction (SELECT pg_terminate_backend(pid)) or drop the unused slot (SELECT pg_drop_replication_slot('name')). The very next autovacuum cycle then reclaims the pinned bloat.

Prevention: set idle_in_transaction_session_timeout to 5–15 minutes in production. This kills sessions that have an open transaction without activity, preventing them from pinning xmin indefinitely.

The hot_standby_feedback wrinkle

Streaming replicas can be configured with hot_standby_feedback = on, which sends the replica’s oldest active xmin back to the primary so the primary’s autovacuum knows not to reclaim tuples a replica still needs. Without it, a long analytical query on a replica can fail with canceling statement due to conflict with recovery.

With it, the primary’s bloat is held hostage by the replica’s longest-running query. Most production setups choose off on the replica and accept the occasional query cancellation, because pinning primary bloat indefinitely is the worse failure mode. The alternative is to route long analytics to a logical replica that maintains its own snapshot policy independently.

Why this works

The cost model for autovacuum’s IO throttle was designed for spinning disks. On NVMe storage the cost delay can safely be set to 0 — the disk can absorb continuous VACUUM IO without impacting foreground queries. On spinning disks, keep cost_delay > 0.

Quiz

A long-running batch job has accumulated 10 GB of dead tuples on an orders table. Autovacuum has been running but n_dead_tup is not dropping. What is the likely cause and how do you confirm?

Quiz

After VACUUM (not VACUUM FULL) runs successfully on a 200 GB table with 40% bloat, what happens to the file size?

Quiz

hot_standby_feedback = on on a replica sends what information to the primary, and what is the risk?

Recall before you leave
  1. 01
    What is the global oldest xmin, where does it come from, and why does it matter for autovacuum?
  2. 02
    What is the autovacuum dead-tuple threshold formula, and what does each parameter do?
  3. 03
    What lock does autovacuum take, and what does it block?
Recap

Every UPDATE leaves a dead tuple on disk; autovacuum reclaims those slots by marking them reusable — but never shrinks the file. VACUUM FULL or pg_repack are needed to return bytes to the OS, with their respective downtime costs. Autovacuum triggers when n_dead_tup exceeds a per-table threshold, takes a non-blocking SHARE UPDATE EXCLUSIVE lock, and can only reclaim tuples older than the cluster’s global oldest xmin. A long-running transaction or orphan replication slot pins that xmin and makes autovacuum’s work silently futile; diagnose with pg_stat_activity.backend_xmin and fix by terminating the session or dropping the slot. Setting idle_in_transaction_session_timeout prevents this at the infrastructure level.

Connected lessons
appears again in140
Continue the climb ↑CLOG, XID wraparound, and MultiXact: deep visibility internals
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.