Performance PERF · 05 · 06

N+1 at scale: pool exhaustion, plan changes, and denormalisation

How N+1 cascades into connection pool exhaustion, why fix strategies change query plans in surprising ways, how N+1 paths become DoS vectors, and when denormalisation is the right answer.

PERF Senior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A service with 50 queries per request and a connection pool of 25 starts returning 503s under load. The individual queries are fast. The database is healthy. The pool is saturated — each request holds a connection across 50 serial queries, and the pool cannot rotate fast enough to serve the traffic.

Connection pool exhaustion

An N+1-bound endpoint holds a database connection through many serialised queries. With a connection pool size of 25 and 50 queries per request at ~2 ms each, each request holds a connection for ~100 ms.

Throughput per connection: ~10 requests/s. Total pool throughput: 250 requests/s. Once load exceeds 250 requests/s, new requests queue waiting for a connection. The queue adds latency; latency tails spiral; eventual 503s.

Diagnostic signature: connection pool saturation metric high, individual query latency stable, request queue growing. The root cause is not the database — it is the application holding connections too long.

# pgbouncer metrics — a leading indicator
cl_waiting > 0 for more than a few seconds → N+1 likely
server_active / max_server_connections > 80% sustained → pool pressure

The fix is the N+1 fix: reduce queries per request from 50 to 2–3. Each request now holds the connection for ~10 ms instead of 100 ms. Pool throughput goes from 250 to 2 500 requests/s — a 10× relief with the same pool size.

The same shape applies to HTTP client pools, gRPC connection pools, Redis pools: any pooled resource where hold-time drives exhaustion.

Query plan changes after N+1 fixes

N+1 fixes change the shape of queries, which changes query plans. A query that was WHERE id = 1 is now WHERE id IN (1, 2, ..., 1000). Postgres may switch from index scan to bitmap scan to sequential scan depending on selectivity estimates and list size.

-- Before: fast index scan
EXPLAIN SELECT * FROM tasks WHERE project_id = 1;
-- Index Scan using tasks_project_id_idx (cost=0.43..8.45 rows=10 width=80)

-- After fix: large IN list, plan may change
EXPLAIN SELECT * FROM tasks WHERE project_id IN (1, 2, ..., 500);
-- Bitmap Heap Scan on tasks (cost=12.55..350.00 rows=5000 width=80)

Senior production discipline: after fixing N+1, run EXPLAIN ANALYZE on the new query under representative production data sizes. Verify the plan does not degrade for edge cases (very small parent sets, very large parent sets, special parameter values). A fix that works at 50 rows may flip the plan at 5 000.

# Postgres tooling for plan monitoring
pg_stat_statements   — track query execution counts and durations
auto_explain         — log plans for queries exceeding a threshold

Hibernate subselect loading strategy generates SELECT * FROM child WHERE parent_id IN (SELECT id FROM parent WHERE ...) — a correlated subquery that can produce very different plans from the parent’s main query. After applying Hibernate batch-size or subselect loading, verify plan stability.

Security: query amplification as DoS

Before you ship a public endpoint backed by ORM queries, ask yourself: what happens if a user passes limit=1000? An N+1 code path can be an attack vector. If a public endpoint allows the user to request N items and the handler fires N+1 queries per item, the attacker can saturate the database with one large request.

Real incidents: Strapi 2021 (CVE-2021-32820) — a deeply nested filter could amplify queries 1 000×. Various GraphQL providers — cyclic queries cause exponential resolver expansion. Early Hasura versions — permission filters not bounded.

Mitigations:

// GraphQL — depth and complexity limits
import depthLimit from 'graphql-depth-limit';
import { createComplexityRule } from 'graphql-query-complexity';

const server = new ApolloServer({
  validationRules: [
    depthLimit(5),
    createComplexityRule({ maxComplexity: 200 }),
  ],
});

Per-request query budget enforced at framework level: Rails strict_loading + max-queries middleware. Hasura automatic complexity analysis. Any user-controlled parameter that multiplies query count is a DoS vector. Senior code review for ORM-using code includes: “could this allow query amplification under hostile input?”

Prepared statements reduce overhead even when N+1 persists

When N+1 is unavoidable (legacy code, exploratory queries), prepared statements reduce the per-query overhead. Each parameterised query WHERE id = ? is parsed and planned once; subsequent executions skip parse and plan.

-- Postgres explicit:
PREPARE get_task AS SELECT * FROM tasks WHERE id = $1;
EXECUTE get_task(42);

-- ORMs use prepared statements by default for parameterised queries.
-- The ORM's ? or $1 parameter syntax IS the prepared statement protocol.

Savings: ~30–50% per repeated query on Postgres. For a service running thousands of repeat queries per second, prepared statements alone can cut DB CPU 10–30%.

Tradeoff: connection-pool affinity. Prepared statements are connection-scoped in some setups. pgbouncer transaction mode disables server-side prepared statements (use PgBouncer statement-level pooling or the pgbouncer_params prepared statement passthrough if you need both).

Metric	N+1 baseline	After fix
Queries per request	51	2
Connection hold time	~100 ms	~4 ms
Pool throughput (pool=25)	~250 req/s	~6 000 req/s
p99 latency (intra-DC)	~800 ms	~40 ms

When denormalisation is the right answer

For paths with extreme latency SLOs — HFT, real-time bidding, sub-50 ms p99 on heavy joins — even 2–3 queries are too many.

Counter cache: stores derived data on the parent. Instead of count(*) per request, one SELECT of a field.

# Rails counter_cache
belongs_to :project, counter_cache: true
# project.tasks_count is maintained automatically on create/destroy

Materialised view (Postgres): precomputes a complex join + aggregation. The page fetches one query against a fully-ready view.

CREATE MATERIALIZED VIEW product_summary AS
SELECT p.id, p.name, c.name AS category,
       COUNT(r.id) AS review_count,
       AVG(r.score) AS avg_score
FROM products p
JOIN categories c ON p.category_id = c.id
LEFT JOIN reviews r ON r.product_id = p.id
GROUP BY p.id, p.name, c.name;

REFRESH MATERIALIZED VIEW CONCURRENTLY product_summary;

CQRS read model: the write side updates a normalised schema; an event handler writes to a read-optimised table with pre-joined data. The read path fetches one query.

Tradeoffs: staleness (read view lags writes by milliseconds to seconds), storage duplication, maintenance cost (derivation logic must stay correct as schema evolves).

When to apply: read p99 SLO < 50 ms on heavy joins AND read load is 1 000× or more the write load. For balanced workloads, ORM eager loading is simpler and sufficient. Denormalisation is the right answer only when measurement shows eager loading cannot meet the SLO.

Denormalisation is a narrow corner of the decision space: only a sub-50 ms SLO on heavy joins combined with 1000×-read-heavy load justifies its staleness and maintenance cost — the other three corners are eager loading.

▸Why this works

Production failures linked to N+1: Shopify 2016 — checkout page regressed to N+1 on shipping calculations; outage. Slack 2018 — message search hit a deeply nested join that exploded under search-result fan-out. Discord 2020 — GraphQL N+1 caused per-channel resolvers to fan out to the permissions service; DataLoader retrofit cut DB load 40%. GitHub 2021 — repo settings page loaded 200+ queries per render after a refactor introduced lazy loading. LinkedIn 2023 — feed aggregator ran 8-way serial RPC fan-out; parallelised, p99 dropped 6×. Every large engineering org has an N+1 retrospective. The runbook entry exists in every mature team: “page slow → check query count first.”

Quiz

A team applies .includes to fix N+1. Query count drops from 50 to 2, but p99 worsens by 30%. What is the most likely explanation?

Quiz

A service is seeing connection pool saturation: active connections near 100%, request queue growing, but individual query latency is stable. What is the most likely root cause and fix?

Detect at PR time, apply the right batch fix, verify the new query plan, guard public endpoints against amplification, and denormalise only when measurement proves eager loading misses the SLO.

Recall before you leave

01
Describe how an N+1 problem cascades into a connection pool exhaustion incident, and what observability signals to alert on.
02
Explain when denormalisation (counter cache, materialised view, CQRS read model) is the right fix for N+1, and when it is premature.

Recap

At production scale, N+1 has three second-order effects beyond slow page loads. First: connection pool exhaustion — each request holding a connection through 50 serial queries saturates the pool far faster than 2-query requests do. Fix the N+1, multiply pool throughput by 10×. Second: query plan instability — IN lists and batch queries can flip from index scan to bitmap scan to sequential scan; always EXPLAIN ANALYZE the new query after fixing. Third: DoS amplification — any user-controlled parameter that multiplies query count is an attack vector; add depth limits, complexity analysis, and per-request query budgets for public endpoints. Now when you ship an N+1 fix, you know the checklist: verify the query plan with EXPLAIN ANALYZE, guard public endpoints with complexity limits, and reach for denormalisation only when measurement proves eager loading cannot close the gap to your SLO.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

appears again in162

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.