awesome-everything RU
↑ Back to the climb

Databases

Row versions and snapshots: the on-disk mechanics

Crux Every Postgres tuple carries a 23-byte header with xmin/xmax; a snapshot is three numbers that decide which versions are visible — this is the entire mechanism.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

You know that MVCC gives each transaction a snapshot. The question is: what is a snapshot, exactly, and how does Postgres decide — for every single row on every single page scan — whether that row belongs in your snapshot or not?

Every row carries a header

When Postgres stores a tuple on a heap page, it prefixes the user columns with a 23-byte system header containing:

  • t_xmin — the transaction id that inserted this version
  • t_xmax — the transaction id that deleted or updated this version (zero if still live)
  • t_cid — the command id within a transaction
  • t_ctid — a pointer to the next version of this row (used for chaining updates)
  • t_infomask / t_infomask2 — bitmaps flagging commit status, lock state, freeze state, and HOT bits

These fields are how visibility decisions get made on every single tuple Postgres looks at.

FieldSizeMeaning
t_xmin4 BTransaction id that created this version
t_xmax4 BTransaction id that deleted/updated; 0 = still live
t_ctid6 BPointer to next version of this row
t_infomask2 BCommit status hint bits, lock flags

How a snapshot decides what to read

When a transaction begins, Postgres builds a snapshot — a small structure with three numbers:

  • xmin — the oldest still-running transaction id
  • xmax — one past the newest committed id at snapshot time
  • xip — the list of in-progress transactions between xmin and xmax

The visibility rule applied to every tuple is roughly: a tuple is visible if its t_xmin is committed and not in the in-progress list (xip), and its t_xmax is either zero, rolled back, or in the in-progress list. That is it. No locks involved.

  • Under READ COMMITTED (the Postgres default), a fresh snapshot is taken at the start of each statement.
  • Under REPEATABLE READ, the snapshot is taken once at the first statement of the transaction.

INSERT, UPDATE, DELETE — what actually happens on disk

INSERT places a fresh tuple with t_xmin = current_txid and t_xmax = 0.

UPDATE is two operations: marks the old tuple’s t_xmax = current_txid and inserts a new tuple with the new values (t_xmin = current_txid, t_xmax = 0). The old tuple’s t_ctid is chained to the new tuple’s page-and-offset.

DELETE just marks the existing tuple’s t_xmax = current_txid. Nothing is physically removed. All operations are logically reversible until commit and physically present until VACUUM.

A visibility walkthrough

Connection A runs BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 42; — does not commit. Connection B runs BEGIN; SELECT balance FROM accounts WHERE id = 42; at the same time.

Row id=42 now has two heap versions:

  • v1: balance=100, t_xmin = prior committed tx, t_xmax = A
  • v2: balance=0, t_xmin = A, t_xmax = 0

B’s snapshot was taken at B’s BEGIN. A is in B’s xip.

  • v1: t_xmin is committed (passes lower bound), t_xmax = A (in xip, deletion not visible) → v1 is visible
  • v2: t_xmin = A (in xip, fails lower bound) → v2 is invisible

B reads balance=100. No locks, no waiting.

Trace one row across an UPDATE and a concurrent SELECT

1/3
Trace it
1/6

Trace what happens to a row's MVCC state under SELECT then UPDATE then DELETE in three separate transactions, with a fourth long-running SELECT keeping an old snapshot alive.

1
Step 1 of 6
T1 inserts (id=5, balance=100). T1 commits. What is the heap state?
2
Locked
T2 begins (snapshot includes T1), reads id=5 (sees balance=100), and stays open without committing. Meanwhile T3 begins.
3
Locked
T3 runs UPDATE accounts SET balance = 50 WHERE id = 5, then commits.
4
Locked
T4 begins, runs DELETE FROM accounts WHERE id = 5, commits.
5
Locked
What does pg_stat_all_tables.n_dead_tup show for this table at this moment?
6
Locked
T2 finally commits. Now what happens?
Quiz

An UPDATE sets the row's t_xmax to the current transaction id and inserts a new tuple. What happens to the old tuple immediately after the transaction commits?

Quiz

Under READ COMMITTED, two SELECT statements in the same transaction query the same row. Transaction A updates and commits between the two SELECTs. What does the second SELECT return?

Recall before you leave
  1. 01
    What are t_xmin and t_xmax in a Postgres tuple header, and what do they each mean?
  2. 02
    What three numbers make up a Postgres snapshot, and what is each for?
  3. 03
    Why is READ COMMITTED surprising if you do two SELECTs on the same row in one transaction?
Recap

Every Postgres tuple carries a 23-byte system header with t_xmin (who inserted it), t_xmax (who deleted it; zero if live), t_ctid (pointer to the next version), and t_infomask hint bits. A snapshot captures three numbers — xmin, xmax, xip — and the visibility rule is applied tuple-by-tuple: is this tuple’s creation transaction committed and not in xip? Is its deletion transaction absent, rolled back, or in xip? No locks are taken. Under READ COMMITTED a fresh snapshot is taken per statement; under REPEATABLE READ once per transaction. Every UPDATE creates a new tuple and marks the old one dead; dead tuples persist until VACUUM confirms no snapshot still references them.

Connected lessons
appears again in140
Continue the climb ↑HOT updates and isolation levels: what you gain and what you pay
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.