Databases DB · 04 · 02

Row versions and snapshots: the on-disk mechanics

Every Postgres tuple carries a 23-byte header with xmin/xmax; a snapshot is three numbers that decide which versions are visible — this is the entire mechanism.

DB Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

You know that MVCC gives each transaction a snapshot. The question is: what is a snapshot, exactly, and how does Postgres decide — for every single row on every single page scan — whether that row belongs in your snapshot or not?

Every row carries a header

When you wonder why two concurrent transactions see different values of the same row — or why a SELECT right after a bulk INSERT is slower than it looks — the answer lives in these 23 bytes.

When Postgres stores a tuple on a heap page, it prefixes the user columns with a 23-byte system header containing:

t_xmin — the transaction id that inserted this version
t_xmax — the transaction id that deleted or updated this version (zero if still live)
t_cid — the command id within a transaction
t_ctid — a pointer to the next version of this row (used for chaining updates)
t_infomask / t_infomask2 — bitmaps flagging commit status, lock state, freeze state, and HOT bits

These fields are how visibility decisions get made on every single tuple Postgres looks at.

Field	Size	Meaning
`t_xmin`	4 B	Transaction id that created this version
`t_xmax`	4 B	Transaction id that deleted/updated; 0 = still live
`t_ctid`	6 B	Pointer to next version of this row
`t_infomask`	2 B	Commit status hint bits, lock flags

This fixed 23-byte header rides ahead of every row's user columns. The visibility math runs against these bytes, which is why every UPDATE that writes a new tuple also pays the header tax again.

How a snapshot decides what to read

When a transaction begins, Postgres builds a snapshot — a small structure with three numbers:

xmin — the oldest still-running transaction id
xmax — one past the newest committed id at snapshot time
xip — the list of in-progress transactions between xmin and xmax

The visibility rule applied to every tuple is roughly: a tuple is visible if its t_xmin is committed and not in the in-progress list (xip), and its t_xmax is either zero, rolled back, or in the in-progress list. That is it. No locks involved.

Under READ COMMITTED (the Postgres default), a fresh snapshot is taken at the start of each statement.
Under REPEATABLE READ, the snapshot is taken once at the first statement of the transaction.

INSERT, UPDATE, DELETE — what actually happens on disk

INSERT places a fresh tuple with t_xmin = current_txid and t_xmax = 0.

UPDATE is two operations: marks the old tuple’s t_xmax = current_txid and inserts a new tuple with the new values (t_xmin = current_txid, t_xmax = 0). The old tuple’s t_ctid is chained to the new tuple’s page-and-offset.

DELETE just marks the existing tuple’s t_xmax = current_txid. Nothing is physically removed. All operations are logically reversible until commit and physically present until VACUUM.

A visibility walkthrough

Connection A runs BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 42; — does not commit. Connection B runs BEGIN; SELECT balance FROM accounts WHERE id = 42; at the same time.

Row id=42 now has two heap versions:

v1: balance=100, t_xmin = prior committed tx, t_xmax = A
v2: balance=0, t_xmin = A, t_xmax = 0

B’s snapshot was taken at B’s BEGIN. A is in B’s xip.

v1: t_xmin is committed (passes lower bound), t_xmax = A (in xip, deletion not visible) → v1 is visible
v2: t_xmin = A (in xip, fails lower bound) → v2 is invisible

B reads balance=100. No locks, no waiting.

Trace one row across an UPDATE and a concurrent SELECT

1/3

Trace it

1/6

Trace what happens to a row's MVCC state under SELECT then UPDATE then DELETE in three separate transactions, with a fourth long-running SELECT keeping an old snapshot alive.

Step 1 of 6

T1 inserts (id=5, balance=100). T1 commits. What is the heap state?

Locked

T2 begins (snapshot includes T1), reads id=5 (sees balance=100), and stays open without committing. Meanwhile T3 begins.

Locked

T3 runs UPDATE accounts SET balance = 50 WHERE id = 5, then commits.

Locked

T4 begins, runs DELETE FROM accounts WHERE id = 5, commits.

Locked

What does pg_stat_all_tables.n_dead_tup show for this table at this moment?

Locked

T2 finally commits. Now what happens?

Quiz

An UPDATE sets the row's t_xmax to the current transaction id and inserts a new tuple. What happens to the old tuple immediately after the transaction commits?

Quiz

Under READ COMMITTED, two SELECT statements in the same transaction query the same row. Transaction A updates and commits between the two SELECTs. What does the second SELECT return?

A updated id=42 but has not committed, so A is in B's in-progress list (xip). v1's deletion by A is not yet visible, so B reads v1 (balance=100); v2 was created by A, so it is invisible to B.

Recall before you leave

01
What are t_xmin and t_xmax in a Postgres tuple header, and what do they each mean?
02
What three numbers make up a Postgres snapshot, and what is each for?
03
Why is READ COMMITTED surprising if you do two SELECTs on the same row in one transaction?

Recap

Every Postgres tuple carries a 23-byte system header with t_xmin (who inserted it), t_xmax (who deleted it; zero if live), t_ctid (pointer to the next version), and t_infomask hint bits. A snapshot captures three numbers — xmin, xmax, xip — and the visibility rule is applied tuple-by-tuple: is this tuple’s creation transaction committed and not in xip? Is its deletion transaction absent, rolled back, or in xip? No locks are taken. Under READ COMMITTED a fresh snapshot is taken per statement; under REPEATABLE READ once per transaction. Every UPDATE creates a new tuple and marks the old one dead; dead tuples persist until VACUUM confirms no snapshot still references them. Now when you see a query return a “stale” value while another session is mid-update, you can trace it directly to xip: that transaction is in the in-progress list, and visibility math keeps the old version alive for you.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

MVCC: why readers and writers never wait for each otherjunior

unlocks

deepens into

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.