Performance PERF · 05 · 10

N+1: diagnose, batch, and gate

Hands-on project — diagnose and eliminate N+1 across an ORM list page, a GraphQL resolver tree, and a service fan-out, then lock it in with a CI query-count gate.

PERF Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about N+1 is not the same as pulling 300 queries out of one page load. Build a small service that exhibits N+1 in three forms — an ORM list page, a GraphQL resolver tree, and a service fan-out — then drive the query count down and gate it so it can never come back.

Goal

Turn the unit’s model into a reproducible loop: count round-trips per request, pick the fix family by cardinality and lookup origin, prove the count dropped from the query log, then add a CI gate and a DoS guard so the regression cannot ship.

Project

0 of 8

Objective

Take a small service (your own or the starter shape below) with three deliberate N+1 sites — an ORM list endpoint, a nested GraphQL query, and a serial service/cache fan-out — and bring each one's round-trip count to its structural minimum, proving every step with the query log and before/after numbers.

Requirements

Acceptance criteria

A before/after table per site: queries per request, p99 latency, and (for the fan-out) wall-clock — measured from the query log and a load test, not estimated.
The query log after each fix shows the round-trip count at its structural minimum (2–4 for the ORM page, one per type for DataLoader, one batched/parallel dispatch for the fan-out).
The DataLoader batch function is verified to return results in input-id order with nulls for missing ids, and its cache is request-scoped (no cross-request leakage).
The CI gate is demonstrated failing a PR that reintroduces an N+1, then passing once the fix is restored.
A one-paragraph write-up naming which fix family was used at each site and why it beat the alternatives for that cardinality and lookup origin.

Senior stretch

Add a DoS-amplification guard to the GraphQL endpoint — depth limit, query-complexity ceiling, and a per-request query budget — and show a crafted deeply-nested query is rejected before it hits the database.
Reproduce a connection-pool exhaustion incident: set a small pool, load-test the N+1 version to 503s, then show the fixed version sustaining ~10× the throughput on the same pool, with pool-saturation metrics before/after.
Run EXPLAIN ANALYZE on the new IN-list query at 10, 500, and 5000 parent ids and report whether the plan flips (index → bitmap → seq scan); document any size where the fix degrades.
Add an APM trace (Tempo/Datadog/Honeycomb or OpenTelemetry) and capture the waterfall before and after — the tall column of short DB spans collapsing into a few — as visual evidence alongside the numbers.

Recap

This is the loop you will run in every real N+1 incident: instrument the query count first, match the fix family to the cardinality and lookup origin (batch the ORM page, DataLoader the resolver tree, parallelise the fan-out), prove the round-trip count dropped from the log, then gate it in CI so the regression cannot reship. Doing it once across all three protocol shapes makes the production version muscle memory — and the CI gate is what keeps the win from eroding the next quarter.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.