awesome-everything RU
↑ Back to the climb

Databases

SSI internals and production autovacuum tuning

Crux SSI detects write skew via predicate locks and rw-antidependency cycles; production autovacuum needs per-table overrides, cost_delay=0 on NVMe, and a separate replica analytics policy to keep bloat bounded.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

A production OLTP cluster suddenly logs hundreds of “could not serialize access” errors per minute on the orders table. You have 15 minutes before the on-call engineer escalates. The diagnosis path — isolation level confirmation, predicate-lock count, workload genuineness — is a step-by-step procedure, not guesswork.

Serializable Snapshot Isolation (SSI) internals

Postgres 9.1 introduced SSI per Cahill, Röhm, and Fekete’s 2008 paper. The mechanism:

  1. Every read tracks “predicate locks” — lightweight, in-memory references to “I read this set”
  2. Every write tracks rw-antidependencies — “this transaction wrote something my predicate matched after I read”
  3. At commit time, if a transaction is part of a “dangerous structure” — a cycle of rw-antidependencies — SSI aborts it with SQLSTATE 40001 could not serialize access due to read/write dependencies

The math is conservative: SSI sometimes aborts transactions that would have committed safely (false positives are correct but cost throughput), but it never allows a non-serializable schedule.

Predicate-lock memory is bounded by max_pred_locks_per_transaction × max_connections. Under memory pressure the locks coarsen from row → page → relation, raising false-positive rate further.

False-positive rateLikely causeFix
0.1–1%Normal SSI behaviorNo action needed
1–5%Predicate-lock coarseningRaise max_pred_locks_per_transaction
>5%Genuine workload conflict or memory pressureApp-level partitioning or switch to RC+FOR UPDATE
Trace it
1/5

A production OLTP cluster suddenly logs hundreds of 'could not serialize access' errors per minute on the orders table. Trace the diagnosis.

1
Step 1 of 5
Step 1: What does the error message tell you about the isolation level in use?
2
Locked
Step 2: How do you confirm the isolation level the application is using?
3
Locked
Step 3: Confirmed SERIALIZABLE. What metric tells you whether this is workload pressure or a config issue?
4
Locked
Step 4: pg_locks shows predicate-lock coarsening from row to page. What do you tune?
5
Locked
Step 5: After tuning, error rate stays high. What is the next hypothesis?

Heap fillfactor and HOT chain length

pg_stat_all_tables.n_tup_hot_upd divided by n_tup_upd gives the HOT ratio. On UPDATE-heavy schemas you want this above 80%. Below 50% means fillfactor is too aggressive or the schema has the wrong index set for the workload.

The pgstattuple extension exposes tuple_count, dead_tuple_count, and free_space per table for deeper introspection.

Production tuning of autovacuum

The default knobs are too lax for any high-write table. Per-table overrides via storage parameters:

ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.05,
  autovacuum_vacuum_cost_limit = 2000,
  autovacuum_vacuum_cost_delay = 2
);

This turns on autovacuum at 5% dead-tuple ratio (instead of 20%), with higher cost-budget per cycle and shorter delay between IO chunks. On NVMe-backed clusters the cost_delay can be zero.

For tables with bursty write patterns, an explicit VACUUM (PARALLEL 4) orders; overnight via cron is common in addition to autovacuum.

autovacuum_max_workers bumps the worker pool from 3 to 6 or 8 on large hosts.

Watch pg_stat_progress_vacuum.phase to see live progress:

  • scanning heap
  • vacuuming indexes
  • vacuuming heap
  • cleaning up indexes
  • truncating heap
  • performing final cleanup

Observability: the metrics that actually move

Minimum-viable Postgres bloat dashboard:

-- Dead tuple ratios
SELECT relname, n_dead_tup, n_live_tup,
       n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables ORDER BY dead_ratio DESC LIMIT 20;

-- Snapshot pinners
SELECT pid, backend_xmin, now() - xact_start AS duration, state, query
FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin LIMIT 5;

-- Orphan replication slots
SELECT slot_name, xmin, confirmed_flush_lsn FROM pg_replication_slots WHERE xmin IS NOT NULL;

-- Wraparound ages
SELECT datname, age(datfrozenxid), age(datminmxid) FROM pg_database;
Pick the best fit

A bank's transfer service moves money between accounts. Pick the isolation strategy that best balances correctness, throughput, and operational complexity.

Design challenge

Design an autovacuum policy for a high-write OLTP table receiving 5,000 row inserts + updates per second, with a target dead-tuple ratio below 10% and zero impact on customer-facing latency.

  • Table is 200 GB, growing 2-3 GB per day.
  • Underlying storage is NVMe, single-tenant.
  • Application opens 5-minute analytics transactions intermittently from a replica.
  • Customer-facing p99 must stay under 50 ms during autovacuum runs.
  • No downtime windows.
Quiz

Postgres logs 'cannot remove tuples deleted by transactions in progress' during autovacuum on a 200 GB orders table. What is the most likely root cause?

Quiz

A bank insists on SERIALIZABLE isolation for transfers. Throughput at peak is 5x lower than under READ COMMITTED. The DBA proposes READ COMMITTED + SELECT FOR UPDATE. The compliance officer worries about correctness. Who is right and why?

Recall before you leave
  1. 01
    Explain how SSI detects write skew — what data structure does it use and when does it abort?
  2. 02
    Explain how a HOT chain breaks and what observable symptoms tell you it is happening too often.
  3. 03
    Why is hot_standby_feedback = OFF the recommended setting on analytics replicas, and what is the risk of turning it ON?
Recap

SSI (Serializable Snapshot Isolation, Postgres 9.1) detects write skew by tracking predicate locks (what I read) and rw-antidependencies (what I wrote that others were watching). A cycle means a dangerous schedule; SSI aborts one participant. False-positive rate of 0.1–1% is normal; above 5% indicates predicate-lock coarsening due to memory pressure or genuine workload conflict. Production autovacuum on high-write tables requires per-table overrides: scale_factor = 0.02–0.05, cost_delay = 0 on NVMe, max_workers = 6–8. Analytics replicas should run with hot_standby_feedback = OFF to decouple primary bloat from replica query duration. The minimum-viable bloat dashboard is dead_tuple ratio + backend_xmin age + 40001 error rate, plus replication slot xmin monitoring.

Connected lessons
appears again in140
Continue the climb ↑Real-world MVCC failures, deployment patterns, and distributed snapshots
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.