Databases DB · 04 · 06

SSI internals and production autovacuum tuning

SSI detects write skew via predicate locks and rw-antidependency cycles; production autovacuum needs per-table overrides, cost_delay=0 on NVMe, and a separate replica analytics policy to keep bloat bounded.

DB Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

A production OLTP cluster suddenly logs hundreds of “could not serialize access” errors per minute on the orders table. You have 15 minutes before the on-call engineer escalates. The diagnosis path — isolation level confirmation, predicate-lock count, workload genuineness — is a step-by-step procedure, not guesswork.

Serializable Snapshot Isolation (SSI) internals

When you see a wave of SQLSTATE 40001 errors and the on-call engineer is asking questions, knowing the underlying mechanism is the difference between a 15-minute diagnosis and a 2-hour escalation.

Postgres 9.1 introduced SSI per Cahill, Röhm, and Fekete’s 2008 paper. The mechanism:

Every read tracks “predicate locks” — lightweight, in-memory references to “I read this set”
Every write tracks rw-antidependencies — “this transaction wrote something my predicate matched after I read”
At commit time, if a transaction is part of a “dangerous structure” — a cycle of rw-antidependencies — SSI aborts it with SQLSTATE 40001 could not serialize access due to read/write dependencies

The math is conservative: SSI sometimes aborts transactions that would have committed safely (false positives are correct but cost throughput), but it never allows a non-serializable schedule.

Predicate-lock memory is bounded by max_pred_locks_per_transaction × max_connections. Under memory pressure the locks coarsen from row → page → relation, raising false-positive rate further.

False-positive rate	Likely cause	Fix
0.1–1%	Normal SSI behavior	No action needed
1–5%	Predicate-lock coarsening	Raise max_pred_locks_per_transaction
>5%	Genuine workload conflict or memory pressure	App-level partitioning or switch to RC+FOR UPDATE

Trace it

1/5

A production OLTP cluster suddenly logs hundreds of 'could not serialize access' errors per minute on the orders table. Trace the diagnosis.

Step 1 of 5

Step 1: What does the error message tell you about the isolation level in use?

Locked

Step 2: How do you confirm the isolation level the application is using?

Locked

Step 3: Confirmed SERIALIZABLE. What metric tells you whether this is workload pressure or a config issue?

Locked

Step 4: pg_locks shows predicate-lock coarsening from row to page. What do you tune?

Locked

Step 5: After tuning, error rate stays high. What is the next hypothesis?

Heap fillfactor and HOT chain length

pg_stat_all_tables.n_tup_hot_upd divided by n_tup_upd gives the HOT ratio. On UPDATE-heavy schemas you want this above 80%. Below 50% means fillfactor is too aggressive or the schema has the wrong index set for the workload.

The pgstattuple extension exposes tuple_count, dead_tuple_count, and free_space per table for deeper introspection.

Production tuning of autovacuum

The default knobs are too lax for any high-write table. Per-table overrides via storage parameters:

ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.05,
  autovacuum_vacuum_cost_limit = 2000,
  autovacuum_vacuum_cost_delay = 2
);

This turns on autovacuum at 5% dead-tuple ratio (instead of 20%), with higher cost-budget per cycle and shorter delay between IO chunks. On NVMe-backed clusters the cost_delay can be zero.

autovacuum_vacuum_scale_factor sets the dead-tuple ratio that triggers a vacuum. The global default of 0.2 (20%) is too lax for high-write tables — per-table overrides of 0.05 or 0.02 vacuum sooner and keep bloat bounded.

For tables with bursty write patterns, an explicit VACUUM (PARALLEL 4) orders; overnight via cron is common in addition to autovacuum.

autovacuum_max_workers bumps the worker pool from 3 to 6 or 8 on large hosts.

Watch pg_stat_progress_vacuum.phase to see live progress:

scanning heap
vacuuming indexes
vacuuming heap
cleaning up indexes
truncating heap
performing final cleanup

Observability: the metrics that actually move

Minimum-viable Postgres bloat dashboard:

-- Dead tuple ratios
SELECT relname, n_dead_tup, n_live_tup,
       n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables ORDER BY dead_ratio DESC LIMIT 20;

-- Snapshot pinners
SELECT pid, backend_xmin, now() - xact_start AS duration, state, query
FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin LIMIT 5;

-- Orphan replication slots
SELECT slot_name, xmin, confirmed_flush_lsn FROM pg_replication_slots WHERE xmin IS NOT NULL;

-- Wraparound ages
SELECT datname, age(datfrozenxid), age(datminmxid) FROM pg_database;

Pick the best fit

A bank's transfer service moves money between accounts. Pick the isolation strategy that best balances correctness, throughput, and operational complexity.

Design challenge

Design an autovacuum policy for a high-write OLTP table receiving 5,000 row inserts + updates per second, with a target dead-tuple ratio below 10% and zero impact on customer-facing latency.

Table is 200 GB, growing 2-3 GB per day.
Underlying storage is NVMe, single-tenant.
Application opens 5-minute analytics transactions intermittently from a replica.
Customer-facing p99 must stay under 50 ms during autovacuum runs.
No downtime windows.

Quiz

Postgres logs 'cannot remove tuples deleted by transactions in progress' during autovacuum on a 200 GB orders table. What is the most likely root cause?

Quiz

A bank insists on SERIALIZABLE isolation for transfers. Throughput at peak is 5x lower than under READ COMMITTED. The DBA proposes READ COMMITTED + SELECT FOR UPDATE. The compliance officer worries about correctness. Who is right and why?

Each transaction reads a set, then writes into the set the other one read. Those two rw-antidependencies close a cycle — a non-serializable schedule — so SSI aborts the second committer with 40001. The app must retry it.

Recall before you leave

01
Explain how SSI detects write skew — what data structure does it use and when does it abort?
02
Explain how a HOT chain breaks and what observable symptoms tell you it is happening too often.
03
Why is hot_standby_feedback = OFF the recommended setting on analytics replicas, and what is the risk of turning it ON?

Recap

SSI (Serializable Snapshot Isolation, Postgres 9.1) detects write skew by tracking predicate locks (what I read) and rw-antidependencies (what I wrote that others were watching). A cycle means a dangerous schedule; SSI aborts one participant. False-positive rate of 0.1–1% is normal; above 5% indicates predicate-lock coarsening due to memory pressure or genuine workload conflict. Production autovacuum on high-write tables requires per-table overrides: scale_factor = 0.02–0.05, cost_delay = 0 on NVMe, max_workers = 6–8. Analytics replicas should run with hot_standby_feedback = OFF to decouple primary bloat from replica query duration. The minimum-viable bloat dashboard is dead_tuple ratio + backend_xmin age + 40001 error rate, plus replication slot xmin monitoring. Now when a 40001 storm hits in production, you have a procedure: confirm SERIALIZABLE, check predicate-lock coarsening in pg_locks, compare the rate against max_pred_locks_per_transaction × max_connections — and only then decide whether to tune or restructure the workload.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.