APIs API · 06 · 06

Senior GraphQL API: scheduling contract, tenant isolation, observability

DataLoader''''s exact scheduling semantics, tenant-safe per-request scoping, APQ vs trusted documents, multiplicative complexity scoring, alias-bomb anatomy, and the minimum-viable observability dashboard.

API Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

Production incident: database call rate 6× normal, HTTP request rate unchanged. Persisted queries are enforced — no unusual hashes. A recent refactor moved one DataLoader out of the context factory into module scope. One line, zero tests catching it, and the cache is now shared across tenants.

One misplaced DataLoader, one missing SQL filter, one naive field-count formula — this lesson covers the three senior-tier failure modes that look correct until they surface in a production incident.

DataLoader’s exact scheduling contract

The library uses enqueuePostPromiseJob — a wrapper over process.nextTick in Node.js and Promise.resolve().then in browsers — to schedule the batch dispatch. The dispatch runs:

After the current synchronous frame
After any microtasks queued during that frame

This means .load() calls inside resolver bodies, inside .then() chains following synchronous resolver work, and inside nextTick-deferred logic all land in the same batch. The window is bounded by the JavaScript event-loop, not a timer. There is no setTimeout, no setImmediate, no fixed millisecond window.

batchScheduleFn overrides this if you need different semantics (e.g. debouncing at the cluster level via a cross-process broker). Production servers leave the default in place.

Tenant isolation in the cache key

The context factory pattern from lesson 02 is necessary but not sufficient for multi-tenant systems. A new DataLoader instance per request prevents cross-request cache hits, but if the batch function queries data without filtering by tenant, two requests from different tenants within the same Node process can receive each other’s data via the database query (not the cache).

The safe pattern:

context: async ({ req }) => ({
  loaders: makeLoaders(req.auth.tenantId),
})

function makeLoaders(tenantId) {
  return {
    user: new DataLoader(async (ids) => {
      const rows = await db.query(
        'SELECT * FROM users WHERE id = ANY($1) AND tenant_id = $2',
        [ids, tenantId]   // tenant scoped at the SQL level
      );
      const map = new Map(rows.map(r => [r.id, r]));
      return ids.map(id => map.get(id) ?? null);
    }),
  };
}

SonarSource’s 2024 audit of OSS GraphQL servers found tenant-leak bugs in 6 of 12 audited codebases — all traceable to module-scope DataLoader or missing tenant filter in the batch query.

APQ vs trusted documents

Two different things called “persisted queries”:

Automatic Persisted Queries (APQ): The client computes SHA-256 of the query string and sends only the hash. If the server has not seen the hash before, it returns PersistedQueryNotFound and the client resends with the full document. APQ saves bytes on warm cache (5 KB query collapses to 64-character hash). It does not constrain query shape — any client can still send any document.
Trusted documents: Only pre-registered hashes execute. Unknown hashes are rejected. Registration happens at client build time. This is a security boundary, not a performance optimisation.

Production teams that need both deploy trusted documents and use APQ-style hashing within the registered set.

Why field-count complexity scoring is an under-counting bug

The naive rule “cost = 1 per field, summed” reports cost 50 for a 10-deep query with 5 fields per level. The real cost is in the row expansion at each list level. A query that requests first: 100 at each of 5 levels reads 100^5 = 10^10 rows conceptually — but the field-count rule still reports 50.

The correct formula multiplies by list-argument size:

cost(field) = field_weight + sum(child.first_arg × cost(child))

With first: 100 at every level, the calculator returns 100^5 — astronomically over budget, query rejected at AST parse time before any resolver fires. GitHub’s public complexity formula is a documented version of this multiplicative rule.

Field-count sums one point per field and reports 50 — sailing under the 1000-cost budget while the query reads 10^10 rows; the multiplicative rule multiplies by each list argument, scores 100^5, and rejects the query before a single resolver fires.

Alias bomb anatomy

A single document with 1000 root aliases:

q1: user(id: 1) { email }
q2: user(id: 2) { email }
...
q1000: user(id: 1000) { email }

This is one valid document, parsed once. Execution runs 1000 resolver calls. DataLoader collapses the database trips to one batch query — but resolver-execution count is still the attacker’s leverage: 1000 resolvers call your permission-checking logic, your logging, your context lookups. A 5 MB document can produce six-figure resolver counts per HTTP request, under any naive per-request rate limit.

Escape.tech’s 2024 audit: 64% of production GraphQL endpoints had no alias caps. Imperva’s 2023 report attributed 18% of GraphQL production incidents to alias-batch DoS.

Minimum-viable observability dashboard

Metric	Alert condition
`graphql_request_total{operation,outcome}`	Error rate above SLO
`graphql_request_duration_p99`	Above latency budget
`graphql_resolver_call_count` per request	Above N (N+1 regression)
`graphql_query_cost` histogram	Long-tail above budget
`graphql_persisted_query_hit_ratio`	Below 90%
`graphql_introspection_request_total`	Non-zero in prod (if introspection off)

Per-resolver tracing via OpenTelemetry GraphQL instrumentation emits a span per type.field call. Aggregating spans by operation gives resolver call counts. Without this instrumentation, an N+1 regression is invisible until the database CPU pages someone.

Senior-tier GraphQL safety numbers

GitHub GraphQL points/hour cap: 5000
GitHub per-query cost cap: 1000
Shopify Storefront per-query cap: 1000 cost units
Shopify Storefront throttle: 1000 cost/sec/IP
Default depth limit (Apollo Router): 10
List-depth recommendation: 3–4
Alias-bomb cap (typical): ≤20 root aliases
Operation-batch cap (typical): ≤5 operations

Quiz

A federated supergraph applies depth limit 7 at the router. Why must each subgraph also apply its own complexity and depth limits?

Quiz

A complexity rule assigns 'cost = 1 per field, summed across the AST'. A 10-deep recursive query with 5 fields per level reports cost 50 and passes the 1000-cost budget. What is wrong?

Quiz

DataLoader is instantiated per-request but the batch function queries the database without a tenant filter. What is the failure mode?

Shape defences (trusted docs → depth → complexity → alias) run before execution; per-request tenant-scoped DataLoader batches the DB; OTel spans catch N+1 regressions.

Recall before you leave

01
Why does Apollo Federation's _entities batching not eliminate the need for DataLoader inside subgraphs?
02
What is the difference between APQ and trusted documents?

Recap

DataLoader’s batch window is the JavaScript event-loop microtask boundary — not a timer. All .load() calls from resolver bodies and their Promise chains land in the same batch. Per-request instantiation prevents cross-request cache pollution; tenant ID in the SQL filter prevents cross-tenant data leaks. APQ saves wire bytes but does not constrain query shape; trusted documents do. Field-count complexity scoring misses list-expansion cost — multiply by first/last arguments to catch 100^5-row queries. Alias bombs bypass rate limits via resolver-count amplification; cap at ≤20 aliases. Wire resolver call counts to OpenTelemetry spans: without per-resolver tracing, N+1 regressions are invisible until the database pages someone. Now when you audit a production GraphQL server, check three things first: is the DataLoader in the context factory, does the batch function include a tenant filter in its SQL, and does the complexity formula multiply by list-argument size? If any answer is no, you’ve found the next incident before it finds you.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

appears again in228

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.