Databases DB · 03 · 03

Join algorithms and the row-estimate cascade

Postgres picks Nested Loop, Hash Join, or Merge Join based on row estimates. When the outer side is underestimated by 1000×, Nested Loop blows up — the algorithm is the symptom, the bad row estimate is the cause.

DB Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A query returns 50 rows in 50ms on staging. In production with the same schema, identical indexes, same query — 4.2 seconds. EXPLAIN ANALYZE reveals Nested Loop ... loops=520000. The inner index scan ran half a million times. The planner thought the outer side had 50 rows. It had 520,000. One bad row estimate. One wrong join choice. 80× slowdown.

The three join algorithms

Algorithm	Cost shape	Wins when	Danger
Nested Loop	outer_rows × inner_cost	Outer is small (10s of rows), inner has an index	Blows up if outer is misestimated as small
Hash Join	build_cost + probe_cost	Medium-to-large equi-joins; neither side has a sort index	Spills to disk if hash table exceeds work_mem
Merge Join	sort_cost + merge_cost	Both sides already sorted (matching index); large equi-joins	Requires sorted input; sort can spill if work_mem is low

Nested Loop

For each row in the outer relation, look up matching rows in the inner relation — typically via index. Cost is outer_rows × inner_cost_per_lookup. It wins when the outer side is small (tens of rows) because the inner index lookup cost is paid only that many times. It catastrophically fails when the outer side is underestimated: if the planner thinks there are 10 outer rows but there are 10,000, the inner lookup runs 10,000 times instead of 10 — 1000× more work.

Diagnostic: loops count on the inner node. In a healthy Nested Loop: loops=50. In a blown-up one: loops=520000.

Hash Join

Builds a hash table from the smaller (build) side, then probes it with rows from the larger (probe) side. Cost is build + probe. Wins for medium-to-large equi-joins where neither side has an index aligned with the join key. The critical parameter is work_mem: the hash table must fit in memory. When it does not, Hash Batches exceeds 1 and the table spills to disk — look for Batches: 64 or similar in the plan output. Fix: SET work_mem = '64MB' for the session (not globally without accounting for max_connections).

Merge Join

Sorts both sides on the join key (or uses indexes that already provide sort order), then merges in parallel. Wins when both sides arrive sorted — for example, joining two tables where both have ORDER BY id and matching indexes. Adding no-extra-sort cost to the join. Useful for range joins and when sort order is also needed for the final result.

The row-estimate cascade

This is the most important concept in this lesson — and the one most engineers miss when they try to fix join performance by tweaking algorithms instead of statistics. A bad row estimate at one plan node cascades into every node above it:

Planner thinks the filter WHERE country='US' AND region='CA' AND status='shipped' returns 50 rows (independent probabilities: 50% × 5% × 20% = 0.5%)
Planner picks Nested Loop — cheap when outer is small
Reality: the columns are correlated (all CA orders are in US), actual selectivity is 5% × 20% = 1% — not 0.5%, but the planner also missed the index cardinality; actual rows = 520,000
The inner index scan runs 520,000 times instead of 50 — 10,400× more work

These four steps together mean that one bad probability multiplication at step 1 corrupts every decision above it; without step 1 being right, steps 2–4 are guaranteed wrong. The wrong join algorithm (Nested Loop instead of Hash Join) is the symptom. The wrong row estimate is the cause. Forcing the algorithm (e.g., SET enable_nestloop = off) masks the symptom but leaves the root cause. Fix the estimate — the algorithm choice follows.

The planner predicted 50 outer rows; reality was 520,000. That single estimate gap — not the algorithm — is what drives the 50ms-to-4.2s blowup.

Non-sargable predicates

A predicate is “sargable” (Search ARGument-ABLE) if the planner can use an index for it. Non-sargable predicates force Seq Scans and corrupt row estimates.

Common offenders:

WHERE LOWER(email) = 'alice@x.com' — function on indexed column → use expression index CREATE INDEX ON users (LOWER(email))
WHERE created_at::date = '2026-01-01' — type coercion → rewrite as WHERE created_at >= '2026-01-01' AND created_at < '2026-01-02'
WHERE EXTRACT(year FROM created_at) = 2026 — function call → same range rewrite
WHERE id::text = '42' — implicit cast → WHERE id = 42 (fix the application type)
WHERE name LIKE '%foo' — leading wildcard → pg_trgm GIN index for fuzzy match

EXPLAIN reveals these immediately: Seq Scan + Filter where you expected Index Scan. The Filter line shows what was applied after the scan rather than before — meaning the index could not help.

Diagnose and fix a row-estimate blowup

1/3

Quiz

An EXPLAIN ANALYZE shows `Nested Loop (cost=0..50 rows=10) ... -> Index Scan ... (loops=10000)`. What is the most likely cause?

Quiz

A query plan shows `Hash Join ... Hash Batches: 64`. What does this indicate and what is the fix?

Quiz

Which predicate is NON-sargable and will force a sequential scan even if an index exists on `created_at`?

The wrong algorithm is the symptom; the bad outer-row estimate is the cause. Fix the estimate (ANALYZE, extended statistics) and the planner stops choosing Nested Loop.

Recall before you leave

01
Explain the row-estimate cascade: why does a bad estimate at one node break the entire plan above it?
02
When is Hash Join the right choice, and what makes it spill to disk?
03
What is a non-sargable predicate, why does it matter for performance, and how do you fix the most common cases?

Recap

Postgres picks among three join algorithms: Nested Loop (outer_rows × inner_cost, wins for small outers with an inner index), Hash Join (build hash table + probe, wins for medium-to-large equi-joins, spills to disk when hash table exceeds work_mem), and Merge Join (sort both sides, merge in parallel, wins when both sides arrive sorted). The algorithm choice is driven by row estimates at each node — a 1000× underestimate on the outer side of a Nested Loop turns the inner lookup into 1000× more work than planned. The wrong algorithm is always a symptom; the wrong row estimate is always the root cause. Fix estimates (ANALYZE, extended statistics) and the algorithm choice corrects itself. Non-sargable predicates (functions on indexed columns, implicit casts) prevent index use and corrupt estimates — rewrite them as range predicates or add expression indexes. Now when you see loops= in the thousands on an inner node, your next step is not to disable Nested Loop — it is to find what made the outer estimate so wrong.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 8 done

Connected lessons

builds on

Scan types: Seq, Index, Bitmap, Index-Onlymiddle

unlocks

deepens into

appears again in177

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Query plan visualizerPaste an EXPLAIN (ANALYZE, FORMAT JSON) and render the plan tree with per-node timing and row-estimate error, so a bad join jumps out visually.