Databases DB · 01 · 05

Heap storage, TOAST, and column alignment

How Postgres stores rows in 8KB pages, what TOAST does to wide values, why column declaration order affects storage size, and how to pick Postgres data types at the byte level.

DB Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A team trims a table from 40 columns to 35. Storage drops by 18%. They did not delete any data — they reordered the remaining columns to eliminate alignment padding. The database never changed its rules. The team finally read them.

The heap: how Postgres stores rows

Postgres stores table rows in fixed-size 8KB pages on disk. The layout matters for storage and query performance:

Each page has a header (24 bytes) and an item pointer array.
Each row (tuple) starts with a 23-byte tuple header containing transaction visibility information (xmin, xmax, ctid — covered in the MVCC lesson).
If any column is nullable, a null bitmap follows (1 bit per column, rounded to a byte).
Then the column data in declaration order, each padded to its alignment boundary.

The total row size affects how many rows fit on one page, which affects how many pages must be read for a sequential scan — one of the most direct connections between schema design and query cost.

Postgres type	Storage size	Alignment	Notes
SMALLINT	2 bytes	2 bytes	Range: ±32767
INTEGER	4 bytes	4 bytes	Range: ±2.1B
BIGINT / BIGSERIAL	8 bytes	8 bytes	Range: ±9.2×10¹⁸
UUID	16 bytes	4 bytes	Native type, not TEXT
BOOLEAN	1 byte	1 byte	True/false/null
TIMESTAMPTZ	8 bytes	8 bytes	Microsecond precision, stored UTC
NUMERIC(p,s)	Variable (2-1000 bytes)	4 bytes	Exact; use for money
TEXT / VARCHAR	Variable (1 byte + content)	4 bytes	No storage difference; TEXT preferred
JSONB	Variable (binary)	4 bytes	TOAST-able above ~2KB

Column alignment padding

Before you declare another column, ask yourself: does its type force a wide gap before the next column? Alignment padding is inserted between columns to ensure each value starts at its alignment boundary. A column layout that alternates between wide and narrow types wastes bytes:

-- Bad order: [SMALLINT(2), BIGINT(8), SMALLINT(2)]
-- Layout: 2 bytes + 6 bytes padding + 8 bytes + 2 bytes + 6 bytes padding = 24 bytes

-- Good order: [BIGINT(8), SMALLINT(2), SMALLINT(2)]
-- Layout: 8 bytes + 2 bytes + 2 bytes + 4 bytes padding = 16 bytes (struct alignment at end)

On a wide table (40 columns of mixed types), poor column ordering can waste 10-20% of storage. ORM-generated schemas rarely optimise for this; it is a real production lever on tables above ~100M rows.

Rule of thumb: declare columns from widest alignment to narrowest (8-byte types first, then 4-byte, then 2-byte, then 1-byte). Variable-length types (TEXT, JSONB) go last — they have internal length headers and their alignment cost is fixed regardless of position relative to fixed-width columns.

TOAST: handling wide values

Postgres pages are 8KB. Rows that exceed approximately 2KB trigger TOAST (The Oversized-Attribute Storage Technique). What happens:

The wide value is compressed (using LZ4 or pglz by default).
If still over the threshold, it is sliced into ~2KB chunks and stored in a separate TOAST table alongside the main table.
The main row holds a pointer (18 bytes) referencing the TOAST chunks.

TOAST is transparent — a SELECT returns the full value as if it were inline — but has performance implications:

Reading a TOASTed column requires an extra disk fetch from the TOAST table (not in the main page).
Updating a TOASTed value rewrites the side-table chunks plus the pointer in the main row.
Listing rows without reading TOASTed columns is cheap — the pointer is inline, the chunks are not fetched unless you SELECT that column.

Implication for schema design: if a wide TEXT/JSONB/BYTEA column is rarely read (it is attached metadata that you only display on a detail page), excluding it from queries that list rows is important. SELECT * FROM events LIMIT 1000 fetches the TOASTed payload for every event — a SELECT id, event_type, created_at FROM events LIMIT 1000 does not.

▸Why this works

TOAST was introduced to preserve the 8KB-page model without hard-limiting row size. It is largely invisible to application code, but visible in EXPLAIN output (the TOAST table appears as a separate relation in the plan if the query reads TOASTed columns). The design trade: page uniformity (all pages are 8KB, predictable I/O) at the cost of occasional extra fetches for wide columns. The alternative would be variable-page-size storage, which Postgres’s buffer manager was not designed for.

Type selection: the production defaults

Integer types. Pick the narrowest range that exceeds your domain by 100x. A user_id that will never exceed 10 million can be INTEGER (4 bytes, saves 4 bytes per index entry). A payment ID that could reach billions should be BIGINT. Never SMALLINT for anything that could grow.

String types. TEXT is the production default. VARCHAR(n) adds a length-check constraint but uses the same storage. CHAR(n) pads to length (trailing spaces cause subtle bugs) — almost always wrong.

Money. NUMERIC(p,s) (exact arithmetic) or BIGINT storing cents. Never REAL or DOUBLE PRECISION — IEEE 754 floats accumulate rounding errors on financial arithmetic. A REAL money column drifts cents after months of transactions.

Timestamps. TIMESTAMPTZ (with time zone) is the production default — stores UTC, displays in the session timezone. TIMESTAMP (without time zone) is a footgun when the application spans multiple timezones. DATE for date-only data.

UUID vs BIGSERIAL. BIGSERIAL is 8 bytes, sequential (B-tree-friendly, good index locality), simple. UUID is 16 bytes, globally unique (good for distributed inserts, offline-first clients, merging datasets from multiple sources). UUIDv4 is random — B-tree indexes fragment badly on large tables. UUIDv7 (time-ordered, RFC 9562) fixes the locality issue with a timestamp prefix; it is the modern default when UUID is needed. Pick BIGSERIAL for single-region services; UUIDv7 for multi-region or offline-first.

The decision is index locality first, bytes second: BIGSERIAL and UUIDv7 keep B-tree inserts append-only; UUIDv4's randomness bloats the index 3-5x; ULID costs 26 bytes as TEXT with no native type.

Quiz

A team runs `SELECT * FROM events ORDER BY created_at DESC LIMIT 100` on a 50M-row table and notices the query is 3x slower than expected even with an index on created_at. The events table has a large JSONB payload column averaging 8KB. The most likely cause?

Pick the best fit

A new service needs a globally-unique row identifier for an orders table. The service is currently single-region but may expand to multi-region in 18 months.

Quiz

A wide events table declared columns in this order: (id BIGSERIAL, created_at TIMESTAMPTZ, type SMALLINT, flags BOOLEAN, user_id BIGINT, details TEXT). Which reordering reduces alignment padding?

A value over ~2KB is compressed, then sliced into ~2KB chunks in the TOAST side table; the main row keeps only an 18-byte pointer. SELECT * follows the pointer and fetches every chunk; a targeted SELECT skips it.

Recall before you leave

01
Describe the TOAST mechanism: when it triggers, what it does to the data, and the two performance implications for schema design.
02
Why is UUIDv4 a poor choice for a primary key on a large table, and how does UUIDv7 fix the problem?
03
State three type choices that are production defaults in Postgres and explain what the wrong alternative costs.

Recap

Postgres rows are stored in 8KB pages: a 23-byte tuple header, an optional null bitmap, then column data padded to alignment boundaries. Column declaration order affects padding — declare fixed-width columns widest-first (8-byte, then 4-byte, then 2-byte, then 1-byte, then variable-length) to minimise wasted bytes. Values wider than ~2KB TOAST to a side table; SELECT * always fetches them. For type selection: TIMESTAMPTZ (never TIMESTAMP), NUMERIC or BIGINT-cents (never REAL/FLOAT for money), TEXT (never CHAR(n)), BIGSERIAL for single-region PKs, UUIDv7 for globally-unique distributed keys. These are schema decisions that are hard to change on large tables — make them right at design time. Now when you see a SELECT * on a table with large JSONB or TEXT columns, you will know to reach for a targeted column list instead.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

JSONB, arrays, and when a side table winsmiddle

unlocks

Schema integrity: deferral, versioning, and production failure modessenior

deepens into

Schema integrity: deferral, versioning, and production failure modessenior

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.