awesome-everything RU
↑ Back to the climb

Databases

Heap storage, TOAST, and column alignment

Crux How Postgres stores rows in 8KB pages, what TOAST does to wide values, why column declaration order affects storage size, and how to pick Postgres data types at the byte level.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

A team trims a table from 40 columns to 35. Storage drops by 18%. They did not delete any data — they reordered the remaining columns to eliminate alignment padding. The database never changed its rules. The team finally read them.

The heap: how Postgres stores rows

Postgres stores table rows in fixed-size 8KB pages on disk. The layout matters for storage and query performance:

  • Each page has a header (24 bytes) and an item pointer array.
  • Each row (tuple) starts with a 23-byte tuple header containing transaction visibility information (xmin, xmax, ctid — covered in the MVCC lesson).
  • If any column is nullable, a null bitmap follows (1 bit per column, rounded to a byte).
  • Then the column data in declaration order, each padded to its alignment boundary.

The total row size affects how many rows fit on one page, which affects how many pages must be read for a sequential scan — one of the most direct connections between schema design and query cost.

Postgres typeStorage sizeAlignmentNotes
SMALLINT2 bytes2 bytesRange: ±32767
INTEGER4 bytes4 bytesRange: ±2.1B
BIGINT / BIGSERIAL8 bytes8 bytesRange: ±9.2×10¹⁸
UUID16 bytes4 bytesNative type, not TEXT
BOOLEAN1 byte1 byteTrue/false/null
TIMESTAMPTZ8 bytes8 bytesMicrosecond precision, stored UTC
NUMERIC(p,s)Variable (2-1000 bytes)4 bytesExact; use for money
TEXT / VARCHARVariable (1 byte + content)4 bytesNo storage difference; TEXT preferred
JSONBVariable (binary)4 bytesTOAST-able above ~2KB

Column alignment padding

Alignment padding is inserted between columns to ensure each value starts at its alignment boundary. A column layout that alternates between wide and narrow types wastes bytes:

-- Bad order: [SMALLINT(2), BIGINT(8), SMALLINT(2)]
-- Layout: 2 bytes + 6 bytes padding + 8 bytes + 2 bytes + 6 bytes padding = 24 bytes

-- Good order: [BIGINT(8), SMALLINT(2), SMALLINT(2)]
-- Layout: 8 bytes + 2 bytes + 2 bytes + 4 bytes padding = 16 bytes (struct alignment at end)

On a wide table (40 columns of mixed types), poor column ordering can waste 10-20% of storage. ORM-generated schemas rarely optimise for this; it is a real production lever on tables above ~100M rows.

Rule of thumb: declare columns from widest alignment to narrowest (8-byte types first, then 4-byte, then 2-byte, then 1-byte). Variable-length types (TEXT, JSONB) go last — they have internal length headers and their alignment cost is fixed regardless of position relative to fixed-width columns.

TOAST: handling wide values

Postgres pages are 8KB. Rows that exceed approximately 2KB trigger TOAST (The Oversized-Attribute Storage Technique). What happens:

  1. The wide value is compressed (using LZ4 or pglz by default).
  2. If still over the threshold, it is sliced into ~2KB chunks and stored in a separate TOAST table alongside the main table.
  3. The main row holds a pointer (18 bytes) referencing the TOAST chunks.

TOAST is transparent — a SELECT returns the full value as if it were inline — but has performance implications:

  • Reading a TOASTed column requires an extra disk fetch from the TOAST table (not in the main page).
  • Updating a TOASTed value rewrites the side-table chunks plus the pointer in the main row.
  • Listing rows without reading TOASTed columns is cheap — the pointer is inline, the chunks are not fetched unless you SELECT that column.

Implication for schema design: if a wide TEXT/JSONB/BYTEA column is rarely read (it is attached metadata that you only display on a detail page), excluding it from queries that list rows is important. SELECT * FROM events LIMIT 1000 fetches the TOASTed payload for every event — a SELECT id, event_type, created_at FROM events LIMIT 1000 does not.

Why this works

TOAST was introduced to preserve the 8KB-page model without hard-limiting row size. It is largely invisible to application code, but visible in EXPLAIN output (the TOAST table appears as a separate relation in the plan if the query reads TOASTed columns). The design trade: page uniformity (all pages are 8KB, predictable I/O) at the cost of occasional extra fetches for wide columns. The alternative would be variable-page-size storage, which Postgres’s buffer manager was not designed for.

Type selection: the production defaults

Integer types. Pick the narrowest range that exceeds your domain by 100x. A user_id that will never exceed 10 million can be INTEGER (4 bytes, saves 4 bytes per index entry). A payment ID that could reach billions should be BIGINT. Never SMALLINT for anything that could grow.

String types. TEXT is the production default. VARCHAR(n) adds a length-check constraint but uses the same storage. CHAR(n) pads to length (trailing spaces cause subtle bugs) — almost always wrong.

Money. NUMERIC(p,s) (exact arithmetic) or BIGINT storing cents. Never REAL or DOUBLE PRECISION — IEEE 754 floats accumulate rounding errors on financial arithmetic. A REAL money column drifts cents after months of transactions.

Timestamps. TIMESTAMPTZ (with time zone) is the production default — stores UTC, displays in the session timezone. TIMESTAMP (without time zone) is a footgun when the application spans multiple timezones. DATE for date-only data.

UUID vs BIGSERIAL. BIGSERIAL is 8 bytes, sequential (B-tree-friendly, good index locality), simple. UUID is 16 bytes, globally unique (good for distributed inserts, offline-first clients, merging datasets from multiple sources). UUIDv4 is random — B-tree indexes fragment badly on large tables. UUIDv7 (time-ordered, RFC 9562) fixes the locality issue with a timestamp prefix; it is the modern default when UUID is needed. Pick BIGSERIAL for single-region services; UUIDv7 for multi-region or offline-first.

Quiz

A team runs `SELECT * FROM events ORDER BY created_at DESC LIMIT 100` on a 50M-row table and notices the query is 3x slower than expected even with an index on created_at. The events table has a large JSONB payload column averaging 8KB. The most likely cause?

Pick the best fit

A new service needs a globally-unique row identifier for an orders table. The service is currently single-region but may expand to multi-region in 18 months.

Quiz

A wide events table declared columns in this order: (id BIGSERIAL, created_at TIMESTAMPTZ, type SMALLINT, flags BOOLEAN, user_id BIGINT, details TEXT). Which reordering reduces alignment padding?

Recall before you leave
  1. 01
    Describe the TOAST mechanism: when it triggers, what it does to the data, and the two performance implications for schema design.
  2. 02
    Why is UUIDv4 a poor choice for a primary key on a large table, and how does UUIDv7 fix the problem?
  3. 03
    State three type choices that are production defaults in Postgres and explain what the wrong alternative costs.
Recap

Postgres rows are stored in 8KB pages: a 23-byte tuple header, an optional null bitmap, then column data padded to alignment boundaries. Column declaration order affects padding — declare fixed-width columns widest-first (8-byte, then 4-byte, then 2-byte, then 1-byte, then variable-length) to minimise wasted bytes. Values wider than ~2KB TOAST to a side table; SELECT * always fetches them. For type selection: TIMESTAMPTZ (never TIMESTAMP), NUMERIC or BIGINT-cents (never REAL/FLOAT for money), TEXT (never CHAR(n)), BIGSERIAL for single-region PKs, UUIDv7 for globally-unique distributed keys. These are schema decisions that are hard to change on large tables — make them right at design time.

Connected lessons
appears again in140
Continue the climb ↑Schema integrity: deferral, versioning, and production failure modes
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.