awesome-everything RU
↑ Back to the climb

Databases

MVCC and isolation: diagnose bloat and a write-skew anomaly

Crux Hands-on project — reproduce MVCC bloat and a write-skew anomaly on a real Postgres, diagnose the oldest-xmin pin, fix isolation and autovacuum, and prove it with before/after numbers.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about runaway bloat and write skew is not the same as watching a table double overnight because someone left a transaction open, or seeing two transactions break an invariant and both commit. Build a small Postgres workload, drive it into both failures, then fix them with the unit’s levers — proving each step with system-view numbers.

Goal

Turn the unit’s mental model into a reproducible loop: instrument MVCC state, reproduce oldest-xmin bloat and a write-skew anomaly on a real database, diagnose each from the catalog views, fix with the minimal isolation and autovacuum change, and verify with before/after measurements.

Project
0 of 7
Objective

On a real PostgreSQL instance, reproduce two MVCC failures — oldest-xmin bloat and a write-skew anomaly — then diagnose and fix both, proving each fix with measurements from pg_stat views, not assertions.

Requirements
Acceptance criteria
  • A before/after table for the bloat scenario: dead-tuple ratio, table size, and oldest backend_xmin age — measured under the same workload, showing the ratio falling after the pin is released.
  • The autovacuum 'not yet removable' log line captured, with the matching pid/slot identified from a catalog query (not guessed).
  • A transcript of the write-skew anomaly committing under REPEATABLE READ, plus a transcript of SERIALIZABLE aborting one transaction with 40001 and the retry preserving the invariant.
  • A one-paragraph write-up: which isolation strategy you would ship for the invariant and why, and which infrastructure setting (timeout, scale_factor) is the durable defence against the bloat failure.
Senior stretch
  • Add a one-page on-call runbook: the three queries to run when a table is bloating (backend_xmin, slot xmin, wraparound ages), how to read the autovacuum log, and the fix decision tree.
  • Demonstrate the HOT path: add a non-indexed column, run an UPDATE-only loop, and show n_tup_hot_upd / n_tup_upd above 80%; then change fillfactor to 100 (or update an indexed column) and show the HOT ratio collapse and WAL per update roughly double.
  • Push the write-skew fix under load: benchmark SERIALIZABLE vs RC + FOR UPDATE on a hot row at high concurrency, and chart the 40001 retry rate against throughput to show where SSI false positives dominate.
  • Reproduce MultiXact growth: run a foreign-key-heavy workload (many child INSERTs against one parent) and watch age(datminmxid) advance, confirming shared locks pressure a separate wraparound clock from XIDs.
Recap

This is the loop you will run in every real MVCC incident: instrument the catalog views first, reproduce the failure deliberately, diagnose from backend_xmin / slot xmin / wraparound ages rather than guessing, fix with the minimal lever (release the pin, pick the weakest isolation level that protects the invariant, tune autovacuum per table), and verify with before/after numbers under the same workload. Doing it once on a toy database makes the production version — where a table is doubling and the pager is going off — muscle memory instead of panic.

Continue the climb ↑Connection pools: amortising the cost of a Postgres backend
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.