awesome-everything RU
↑ Back to the climb

Databases

Join algorithms and the row-estimate cascade

Crux Postgres picks Nested Loop, Hash Join, or Merge Join based on row estimates. When the outer side is underestimated by 1000×, Nested Loop blows up — the algorithm is the symptom, the bad row estimate is the cause.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 16 min

A query returns 50 rows in 50ms on staging. In production with the same schema, identical indexes, same query — 4.2 seconds. EXPLAIN ANALYZE reveals Nested Loop ... loops=520000. The inner index scan ran half a million times. The planner thought the outer side had 50 rows. It had 520,000. One bad row estimate. One wrong join choice. 80× slowdown.

The three join algorithms

AlgorithmCost shapeWins whenDanger
Nested Loopouter_rows × inner_costOuter is small (10s of rows), inner has an indexBlows up if outer is misestimated as small
Hash Joinbuild_cost + probe_costMedium-to-large equi-joins; neither side has a sort indexSpills to disk if hash table exceeds work_mem
Merge Joinsort_cost + merge_costBoth sides already sorted (matching index); large equi-joinsRequires sorted input; sort can spill if work_mem is low

Nested Loop

For each row in the outer relation, look up matching rows in the inner relation — typically via index. Cost is outer_rows × inner_cost_per_lookup. It wins when the outer side is small (tens of rows) because the inner index lookup cost is paid only that many times. It catastrophically fails when the outer side is underestimated: if the planner thinks there are 10 outer rows but there are 10,000, the inner lookup runs 10,000 times instead of 10 — 1000× more work.

Diagnostic: loops count on the inner node. In a healthy Nested Loop: loops=50. In a blown-up one: loops=520000.

Hash Join

Builds a hash table from the smaller (build) side, then probes it with rows from the larger (probe) side. Cost is build + probe. Wins for medium-to-large equi-joins where neither side has an index aligned with the join key. The critical parameter is work_mem: the hash table must fit in memory. When it does not, Hash Batches exceeds 1 and the table spills to disk — look for Batches: 64 or similar in the plan output. Fix: SET work_mem = '64MB' for the session (not globally without accounting for max_connections).

Merge Join

Sorts both sides on the join key (or uses indexes that already provide sort order), then merges in parallel. Wins when both sides arrive sorted — for example, joining two tables where both have ORDER BY id and matching indexes. Adding no-extra-sort cost to the join. Useful for range joins and when sort order is also needed for the final result.

The row-estimate cascade

This is the most important concept in this lesson. A bad row estimate at one plan node cascades into every node above it:

  1. Planner thinks the filter WHERE country='US' AND region='CA' AND status='shipped' returns 50 rows (independent probabilities: 50% × 5% × 20% = 0.5%)
  2. Planner picks Nested Loop — cheap when outer is small
  3. Reality: the columns are correlated (all CA orders are in US), actual selectivity is 5% × 20% = 1% — not 0.5%, but the planner also missed the index cardinality; actual rows = 520,000
  4. The inner index scan runs 520,000 times instead of 50 — 10,400× more work

The wrong join algorithm (Nested Loop instead of Hash Join) is the symptom. The wrong row estimate is the cause. Forcing the algorithm (e.g., SET enable_nestloop = off) masks the symptom but leaves the root cause. Fix the estimate — the algorithm choice follows.

Non-sargable predicates

A predicate is “sargable” (Search ARGument-ABLE) if the planner can use an index for it. Non-sargable predicates force Seq Scans and corrupt row estimates.

Common offenders:

  • WHERE LOWER(email) = 'alice@x.com' — function on indexed column → use expression index CREATE INDEX ON users (LOWER(email))
  • WHERE created_at::date = '2026-01-01' — type coercion → rewrite as WHERE created_at >= '2026-01-01' AND created_at < '2026-01-02'
  • WHERE EXTRACT(year FROM created_at) = 2026 — function call → same range rewrite
  • WHERE id::text = '42' — implicit cast → WHERE id = 42 (fix the application type)
  • WHERE name LIKE '%foo' — leading wildcard → pg_trgm GIN index for fuzzy match

EXPLAIN reveals these immediately: Seq Scan + Filter where you expected Index Scan. The Filter line shows what was applied after the scan rather than before — meaning the index could not help.

Diagnose and fix a row-estimate blowup

1/3
Quiz

An EXPLAIN ANALYZE shows `Nested Loop (cost=0..50 rows=10) ... -> Index Scan ... (loops=10000)`. What is the most likely cause?

Quiz

A query plan shows `Hash Join ... Hash Batches: 64`. What does this indicate and what is the fix?

Quiz

Which predicate is NON-sargable and will force a sequential scan even if an index exists on `created_at`?

Recall before you leave
  1. 01
    Explain the row-estimate cascade: why does a bad estimate at one node break the entire plan above it?
  2. 02
    When is Hash Join the right choice, and what makes it spill to disk?
  3. 03
    What is a non-sargable predicate, why does it matter for performance, and how do you fix the most common cases?
Recap

Postgres picks among three join algorithms: Nested Loop (outer_rows × inner_cost, wins for small outers with an inner index), Hash Join (build hash table + probe, wins for medium-to-large equi-joins, spills to disk when hash table exceeds work_mem), and Merge Join (sort both sides, merge in parallel, wins when both sides arrive sorted). The algorithm choice is driven by row estimates at each node — a 1000× underestimate on the outer side of a Nested Loop turns the inner lookup into 1000× more work than planned. The wrong algorithm is always a symptom; the wrong row estimate is always the root cause. Fix estimates (ANALYZE, extended statistics) and the algorithm choice corrects itself. Non-sargable predicates (functions on indexed columns, implicit casts) prevent index use and corrupt estimates — rewrite them as range predicates or add expression indexes.

Practice

Do these to turn recognition into skill.

Connected lessons
appears again in174
Continue the climb ↑pg_statistic, ANALYZE, and production observability
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.