Databases
Heap storage, TOAST, and column alignment
A team trims a table from 40 columns to 35. Storage drops by 18%. They did not delete any data — they reordered the remaining columns to eliminate alignment padding. The database never changed its rules. The team finally read them.
The heap: how Postgres stores rows
Postgres stores table rows in fixed-size 8KB pages on disk. The layout matters for storage and query performance:
- Each page has a header (24 bytes) and an item pointer array.
- Each row (tuple) starts with a 23-byte tuple header containing transaction visibility information (xmin, xmax, ctid — covered in the MVCC lesson).
- If any column is nullable, a null bitmap follows (1 bit per column, rounded to a byte).
- Then the column data in declaration order, each padded to its alignment boundary.
The total row size affects how many rows fit on one page, which affects how many pages must be read for a sequential scan — one of the most direct connections between schema design and query cost.
| Postgres type | Storage size | Alignment | Notes |
|---|---|---|---|
| SMALLINT | 2 bytes | 2 bytes | Range: ±32767 |
| INTEGER | 4 bytes | 4 bytes | Range: ±2.1B |
| BIGINT / BIGSERIAL | 8 bytes | 8 bytes | Range: ±9.2×10¹⁸ |
| UUID | 16 bytes | 4 bytes | Native type, not TEXT |
| BOOLEAN | 1 byte | 1 byte | True/false/null |
| TIMESTAMPTZ | 8 bytes | 8 bytes | Microsecond precision, stored UTC |
| NUMERIC(p,s) | Variable (2-1000 bytes) | 4 bytes | Exact; use for money |
| TEXT / VARCHAR | Variable (1 byte + content) | 4 bytes | No storage difference; TEXT preferred |
| JSONB | Variable (binary) | 4 bytes | TOAST-able above ~2KB |
Column alignment padding
Alignment padding is inserted between columns to ensure each value starts at its alignment boundary. A column layout that alternates between wide and narrow types wastes bytes:
-- Bad order: [SMALLINT(2), BIGINT(8), SMALLINT(2)]
-- Layout: 2 bytes + 6 bytes padding + 8 bytes + 2 bytes + 6 bytes padding = 24 bytes
-- Good order: [BIGINT(8), SMALLINT(2), SMALLINT(2)]
-- Layout: 8 bytes + 2 bytes + 2 bytes + 4 bytes padding = 16 bytes (struct alignment at end)On a wide table (40 columns of mixed types), poor column ordering can waste 10-20% of storage. ORM-generated schemas rarely optimise for this; it is a real production lever on tables above ~100M rows.
Rule of thumb: declare columns from widest alignment to narrowest (8-byte types first, then 4-byte, then 2-byte, then 1-byte). Variable-length types (TEXT, JSONB) go last — they have internal length headers and their alignment cost is fixed regardless of position relative to fixed-width columns.
TOAST: handling wide values
Postgres pages are 8KB. Rows that exceed approximately 2KB trigger TOAST (The Oversized-Attribute Storage Technique). What happens:
- The wide value is compressed (using LZ4 or pglz by default).
- If still over the threshold, it is sliced into ~2KB chunks and stored in a separate TOAST table alongside the main table.
- The main row holds a pointer (18 bytes) referencing the TOAST chunks.
TOAST is transparent — a SELECT returns the full value as if it were inline — but has performance implications:
- Reading a TOASTed column requires an extra disk fetch from the TOAST table (not in the main page).
- Updating a TOASTed value rewrites the side-table chunks plus the pointer in the main row.
- Listing rows without reading TOASTed columns is cheap — the pointer is inline, the chunks are not fetched unless you SELECT that column.
Implication for schema design: if a wide TEXT/JSONB/BYTEA column is rarely read (it is attached metadata that you only display on a detail page), excluding it from queries that list rows is important. SELECT * FROM events LIMIT 1000 fetches the TOASTed payload for every event — a SELECT id, event_type, created_at FROM events LIMIT 1000 does not.
Why this works
TOAST was introduced to preserve the 8KB-page model without hard-limiting row size. It is largely invisible to application code, but visible in EXPLAIN output (the TOAST table appears as a separate relation in the plan if the query reads TOASTed columns). The design trade: page uniformity (all pages are 8KB, predictable I/O) at the cost of occasional extra fetches for wide columns. The alternative would be variable-page-size storage, which Postgres’s buffer manager was not designed for.
Type selection: the production defaults
Integer types. Pick the narrowest range that exceeds your domain by 100x. A user_id that will never exceed 10 million can be INTEGER (4 bytes, saves 4 bytes per index entry). A payment ID that could reach billions should be BIGINT. Never SMALLINT for anything that could grow.
String types. TEXT is the production default. VARCHAR(n) adds a length-check constraint but uses the same storage. CHAR(n) pads to length (trailing spaces cause subtle bugs) — almost always wrong.
Money. NUMERIC(p,s) (exact arithmetic) or BIGINT storing cents. Never REAL or DOUBLE PRECISION — IEEE 754 floats accumulate rounding errors on financial arithmetic. A REAL money column drifts cents after months of transactions.
Timestamps. TIMESTAMPTZ (with time zone) is the production default — stores UTC, displays in the session timezone. TIMESTAMP (without time zone) is a footgun when the application spans multiple timezones. DATE for date-only data.
UUID vs BIGSERIAL. BIGSERIAL is 8 bytes, sequential (B-tree-friendly, good index locality), simple. UUID is 16 bytes, globally unique (good for distributed inserts, offline-first clients, merging datasets from multiple sources). UUIDv4 is random — B-tree indexes fragment badly on large tables. UUIDv7 (time-ordered, RFC 9562) fixes the locality issue with a timestamp prefix; it is the modern default when UUID is needed. Pick BIGSERIAL for single-region services; UUIDv7 for multi-region or offline-first.
A team runs `SELECT * FROM events ORDER BY created_at DESC LIMIT 100` on a 50M-row table and notices the query is 3x slower than expected even with an index on created_at. The events table has a large JSONB payload column averaging 8KB. The most likely cause?
A new service needs a globally-unique row identifier for an orders table. The service is currently single-region but may expand to multi-region in 18 months.
A wide events table declared columns in this order: (id BIGSERIAL, created_at TIMESTAMPTZ, type SMALLINT, flags BOOLEAN, user_id BIGINT, details TEXT). Which reordering reduces alignment padding?
- 01Describe the TOAST mechanism: when it triggers, what it does to the data, and the two performance implications for schema design.
- 02Why is UUIDv4 a poor choice for a primary key on a large table, and how does UUIDv7 fix the problem?
- 03State three type choices that are production defaults in Postgres and explain what the wrong alternative costs.
Postgres rows are stored in 8KB pages: a 23-byte tuple header, an optional null bitmap, then column data padded to alignment boundaries. Column declaration order affects padding — declare fixed-width columns widest-first (8-byte, then 4-byte, then 2-byte, then 1-byte, then variable-length) to minimise wasted bytes. Values wider than ~2KB TOAST to a side table; SELECT * always fetches them. For type selection: TIMESTAMPTZ (never TIMESTAMP), NUMERIC or BIGINT-cents (never REAL/FLOAT for money), TEXT (never CHAR(n)), BIGSERIAL for single-region PKs, UUIDv7 for globally-unique distributed keys. These are schema decisions that are hard to change on large tables — make them right at design time.
appears again in140
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- The incident loop: from pager to postmortem to preventionmiddle
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior