APIs
GraphQL N+1: batch and harden an API
Reading about N+1 is not the same as watching a database log fire 51 times for one HTTP request and then driving it back to 2. Build a small GraphQL API on purpose-nested data, measure the storm, batch it with DataLoader, then harden it against the query shapes DataLoader cannot touch — with evidence at every step.
Turn the unit’s mental model into a reproducible loop: reproduce N+1 and prove it with SQL/resolver counts, fix it with a correct per-request DataLoader, defend the API against depth, complexity, and alias attacks, and verify the whole thing with before/after numbers.
Build (or take) a GraphQL API over a relational schema with nested lists, reproduce its N+1 storm under load, collapse it with DataLoader, and add query-shape defences — proving each step with measured SQL counts, resolver counts, and latency.
- A before/after table: total SQL queries, resolver call counts, and p99 latency for the same 50-post query under the same load — measured from your harness, not estimated.
- DataLoaders are instantiated per request in the context factory (demonstrated), and a test proves a module-scope instance would leak across two simulated tenants or serve a stale row.
- The batch-contract tests pass: dedup returns one SQL trip for a doubly-referenced author, and the out-of-order-rows test still maps every author correctly.
- Three crafted attack queries (a deep recursive query, a 5-level first:100 query, and a 1000-alias document) are each rejected at validation with the defence that caught them named.
- Split the schema into two Apollo Federation subgraphs (posts and users), confirm the router batches the cross-subgraph refs into one _entities call, and show __resolveReference still needs its own DataLoader to avoid intra-subgraph N+1.
- Implement resolver lookahead for one deep path (posts { author { profile } }) by reading the info AST and issuing a single JOIN; measure it against three stacked DataLoader trips and note the isolation tradeoff.
- Add a multi-tenant column and prove tenant isolation belongs in the SQL filter inside the batch function, not just in per-request scope — write a test that fails with scope-only isolation and passes with the tenant_id filter.
- Wire a CI gate that runs the 50-post query against a canary and fails the build if any type.field resolver call count grows beyond a baseline (catching an N+1 regression before it ships).
This is the loop you run on every real GraphQL performance incident: build the evidence harness first (SQL and resolver counts), reproduce and measure the N+1, fix it with a per-request DataLoader written to the order-and-shape contract, prove correctness with dedup and out-of-order tests, then layer the query-shape defences DataLoader cannot replace — depth, multiplicative complexity, alias caps. Verify with before/after numbers under identical load. Doing it once on a toy API makes the production diagnosis muscle memory.