Databases DB · 04 · 04

Vacuum and bloat: keeping the storage tax bounded

Autovacuum reclaims dead tuple slots but cannot shrink the file; a long transaction or orphan replication slot pins everything it touched. Understanding the mechanics is the difference between 20 GB and 200 GB tables.

DB Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

An autovacuum worker ran for 30 minutes on your orders table, removed zero dead tuples, and logged “287 million are not yet removable.” The table keeps growing. This is not an autovacuum configuration problem — it is a snapshot pinning problem, and knowing which is which determines the fix.

Where the cost is paid

Each UPDATE creates one fresh tuple plus marks the old tuple dead — both on disk, both in the heap, both eligible for index entries. Until VACUUM clears the dead tuples, every sequential scan walks past them and every page is bigger than the strict minimum.

A common rule of thumb at production scale: target dead-tuple ratio below 20% for tables above 50 GB. Smaller tables tolerate higher bloat (30–50%) because the absolute waste is small.

VACUUM never physically shrinks the file — it just marks space reusable inside the file. Reclaiming disk back to the operating system requires:

VACUUM FULL — rewrites the whole table while holding an ACCESS EXCLUSIVE lock, blocking every reader and writer
pg_repack extension — rewrites concurrently, swaps at the end (covered in lesson 07)

Operation	What it reclaims	Blocks reads/writes?	Shrinks file?
Regular VACUUM / autovacuum	Dead tuple slots → reusable	No (SHARE UPDATE EXCLUSIVE)	No
VACUUM FULL	Rewrites whole table	Yes (ACCESS EXCLUSIVE)	Yes
pg_repack	Rewrites concurrently	Brief swap only	Yes

Regular VACUUM frees the dead slots for reuse but the file stays 200 GB; only VACUUM FULL or pg_repack rewrite the table and return ~80 GB to the OS.

How autovacuum actually decides to run

Autovacuum is a background process pool (3 workers by default) that wakes every minute and inspects every table for two threshold conditions:

Dead-tuple threshold: autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup (default: 50 + 20% of live tuples)
Insert threshold: similar formula for fresh inserts that need ANALYZE

When a table crosses a threshold, autovacuum dispatches a worker. The worker takes a SHARE UPDATE EXCLUSIVE lock — strong enough to block other VACUUMs and schema changes, weak enough to let regular reads and writes proceed in parallel.

The worker computes the global oldest xmin across all sessions (the cap on reclaimable tuples), scans the heap, identifies dead tuples whose t_xmax is older than oldest xmin, marks their slots reusable, walks each index to remove stale pointers, and rebuilds the table’s free-space map.

Throughout, autovacuum is rate-limited by a cost-based delay: every page read costs a few units, every page write costs more units, and once the worker exceeds autovacuum_vacuum_cost_limit, it sleeps for autovacuum_vacuum_cost_delay milliseconds.

Order the steps

Put the steps of an autovacuum run on a bloated table in order:

1 Scheduler wakes up; checks pg_stat_all_tables.n_dead_tup vs autovacuum_vacuum_threshold + scale_factor × n_live_tup
2 Worker connects to the database; acquires a SHARE UPDATE EXCLUSIVE lock on the table (concurrent reads + writes continue)
3 Worker computes the oldest xmin across all sessions (pg_stat_activity.backend_xmin) — only tuples older than this can be removed
4 Heap scan: visit each page, identify dead tuples whose t_xmax < oldest xmin
5 Mark dead tuple slots reusable inside each page; rebuild the page's free-space map entry
6 Index cleanup: walk each index and remove entries pointing at reclaimed heap slots
7 Update pg_class.relfrozenxid + pg_stat_all_tables counters; release lock

What pins the oldest xmin

The most common cause of bloat that autovacuum cannot reclaim: a long-running transaction or orphan replication slot is holding back the global oldest xmin.

Diagnose with:

SELECT pid, backend_xmin, now() - xact_start AS duration
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin LIMIT 5;

SELECT slot_name, xmin FROM pg_replication_slots WHERE xmin IS NOT NULL;

Fix: terminate the long transaction (SELECT pg_terminate_backend(pid)) or drop the unused slot (SELECT pg_drop_replication_slot('name')). The very next autovacuum cycle then reclaims the pinned bloat.

Prevention: set idle_in_transaction_session_timeout to 5–15 minutes in production. This kills sessions that have an open transaction without activity, preventing them from pinning xmin indefinitely.

The hot_standby_feedback wrinkle

Streaming replicas can be configured with hot_standby_feedback = on, which sends the replica’s oldest active xmin back to the primary so the primary’s autovacuum knows not to reclaim tuples a replica still needs. Without it, a long analytical query on a replica can fail with canceling statement due to conflict with recovery.

With it, the primary’s bloat is held hostage by the replica’s longest-running query. Most production setups choose off on the replica and accept the occasional query cancellation, because pinning primary bloat indefinitely is the worse failure mode. The alternative is to route long analytics to a logical replica that maintains its own snapshot policy independently.

▸Why this works

The cost model for autovacuum’s IO throttle was designed for spinning disks. On NVMe storage the cost delay can safely be set to 0 — the disk can absorb continuous VACUUM IO without impacting foreground queries. On spinning disks, keep cost_delay > 0.

Quiz

A long-running batch job has accumulated 10 GB of dead tuples on an orders table. Autovacuum has been running but n_dead_tup is not dropping. What is the likely cause and how do you confirm?

Quiz

After VACUUM (not VACUUM FULL) runs successfully on a 200 GB table with 40% bloat, what happens to the file size?

Quiz

hot_standby_feedback = on on a replica sends what information to the primary, and what is the risk?

A worker only frees tuples whose t_xmax is older than the global oldest xmin. A long transaction or orphan replication slot holds that xmin back, so newer dead tuples stay pinned and the file keeps growing.

Recall before you leave

01
What is the global oldest xmin, where does it come from, and why does it matter for autovacuum?
02
What is the autovacuum dead-tuple threshold formula, and what does each parameter do?
03
What lock does autovacuum take, and what does it block?

Recap

Every UPDATE leaves a dead tuple on disk; autovacuum reclaims those slots by marking them reusable — but never shrinks the file. VACUUM FULL or pg_repack are needed to return bytes to the OS, with their respective downtime costs. Autovacuum triggers when n_dead_tup exceeds a per-table threshold, takes a non-blocking SHARE UPDATE EXCLUSIVE lock, and can only reclaim tuples older than the cluster’s global oldest xmin. A long-running transaction or orphan replication slot pins that xmin and makes autovacuum’s work silently futile; diagnose with pg_stat_activity.backend_xmin and fix by terminating the session or dropping the slot. Setting idle_in_transaction_session_timeout prevents this at the infrastructure level. Now when you see autovacuum running repeatedly but dead-tuple count not dropping, your first query is not a config knob — it is SELECT pid, backend_xmin FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin LIMIT 5.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Row versions and snapshots: the on-disk mechanicsmiddle

unlocks

deepens into

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Crash-safe key-value store with a WALBuild a tiny on-disk KV store that survives a kill -9 mid-write by appending to a write-ahead log before touching the main file.