awesome-everything RU
↑ Back to the climb

APIs

Senior GraphQL API: scheduling contract, tenant isolation, observability

Crux DataLoader''''s exact scheduling semantics, tenant-safe per-request scoping, APQ vs trusted documents, multiplicative complexity scoring, alias-bomb anatomy, and the minimum-viable observability dashboard.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

Production incident: database call rate 6× normal, HTTP request rate unchanged. Persisted queries are enforced — no unusual hashes. A recent refactor moved one DataLoader out of the context factory into module scope. One line, zero tests catching it, and the cache is now shared across tenants.

DataLoader’s exact scheduling contract

The library uses enqueuePostPromiseJob — a wrapper over process.nextTick in Node.js and Promise.resolve().then in browsers — to schedule the batch dispatch. The dispatch runs:

  • After the current synchronous frame
  • After any microtasks queued during that frame

This means .load() calls inside resolver bodies, inside .then() chains following synchronous resolver work, and inside nextTick-deferred logic all land in the same batch. The window is bounded by the JavaScript event-loop, not a timer. There is no setTimeout, no setImmediate, no fixed millisecond window.

batchScheduleFn overrides this if you need different semantics (e.g. debouncing at the cluster level via a cross-process broker). Production servers leave the default in place.

Tenant isolation in the cache key

The context factory pattern from lesson 02 is necessary but not sufficient for multi-tenant systems. A new DataLoader instance per request prevents cross-request cache hits, but if the batch function queries data without filtering by tenant, two requests from different tenants within the same Node process can receive each other’s data via the database query (not the cache).

The safe pattern:

context: async ({ req }) => ({
  loaders: makeLoaders(req.auth.tenantId),
})

function makeLoaders(tenantId) {
  return {
    user: new DataLoader(async (ids) => {
      const rows = await db.query(
        'SELECT * FROM users WHERE id = ANY($1) AND tenant_id = $2',
        [ids, tenantId]   // tenant scoped at the SQL level
      );
      const map = new Map(rows.map(r => [r.id, r]));
      return ids.map(id => map.get(id) ?? null);
    }),
  };
}

SonarSource’s 2024 audit of OSS GraphQL servers found tenant-leak bugs in 6 of 12 audited codebases — all traceable to module-scope DataLoader or missing tenant filter in the batch query.

APQ vs trusted documents

Two different things called “persisted queries”:

  • Automatic Persisted Queries (APQ): The client computes SHA-256 of the query string and sends only the hash. If the server has not seen the hash before, it returns PersistedQueryNotFound and the client resends with the full document. APQ saves bytes on warm cache (5 KB query collapses to 64-character hash). It does not constrain query shape — any client can still send any document.

  • Trusted documents: Only pre-registered hashes execute. Unknown hashes are rejected. Registration happens at client build time. This is a security boundary, not a performance optimisation.

Production teams that need both deploy trusted documents and use APQ-style hashing within the registered set.

Why field-count complexity scoring is an under-counting bug

The naive rule “cost = 1 per field, summed” reports cost 50 for a 10-deep query with 5 fields per level. The real cost is in the row expansion at each list level. A query that requests first: 100 at each of 5 levels reads 100^5 = 10^10 rows conceptually — but the field-count rule still reports 50.

The correct formula multiplies by list-argument size:

cost(field) = field_weight + sum(child.first_arg × cost(child))

With first: 100 at every level, the calculator returns 100^5 — astronomically over budget, query rejected at AST parse time before any resolver fires. GitHub’s public complexity formula is a documented version of this multiplicative rule.

Alias bomb anatomy

A single document with 1000 root aliases:

q1: user(id: 1) { email }
q2: user(id: 2) { email }
...
q1000: user(id: 1000) { email }

This is one valid document, parsed once. Execution runs 1000 resolver calls. DataLoader collapses the database trips to one batch query — but resolver-execution count is still the attacker’s leverage: 1000 resolvers call your permission-checking logic, your logging, your context lookups. A 5 MB document can produce six-figure resolver counts per HTTP request, under any naive per-request rate limit.

Escape.tech’s 2024 audit: 64% of production GraphQL endpoints had no alias caps. Imperva’s 2023 report attributed 18% of GraphQL production incidents to alias-batch DoS.

Minimum-viable observability dashboard

MetricAlert condition
graphql_request_total{operation,outcome}Error rate above SLO
graphql_request_duration_p99Above latency budget
graphql_resolver_call_count per requestAbove N (N+1 regression)
graphql_query_cost histogramLong-tail above budget
graphql_persisted_query_hit_ratioBelow 90%
graphql_introspection_request_totalNon-zero in prod (if introspection off)

Per-resolver tracing via OpenTelemetry GraphQL instrumentation emits a span per type.field call. Aggregating spans by operation gives resolver call counts. Without this instrumentation, an N+1 regression is invisible until the database CPU pages someone.

Senior-tier GraphQL safety numbers
GitHub GraphQL points/hour cap
5000
GitHub per-query cost cap
1000
Shopify Storefront per-query cap
1000 cost units
Shopify Storefront throttle
1000 cost/sec/IP
Default depth limit (Apollo Router)
10
List-depth recommendation
3–4
Alias-bomb cap (typical)
≤20 root aliases
Operation-batch cap (typical)
≤5 operations
Quiz

A federated supergraph applies depth limit 7 at the router. Why must each subgraph also apply its own complexity and depth limits?

Quiz

A complexity rule assigns 'cost = 1 per field, summed across the AST'. A 10-deep recursive query with 5 fields per level reports cost 50 and passes the 1000-cost budget. What is wrong?

Quiz

DataLoader is instantiated per-request but the batch function queries the database without a tenant filter. What is the failure mode?

Recall before you leave
  1. 01
    Why does Apollo Federation's _entities batching not eliminate the need for DataLoader inside subgraphs?
  2. 02
    What is the difference between APQ and trusted documents?
Recap

DataLoader’s batch window is the JavaScript event-loop microtask boundary — not a timer. All .load() calls from resolver bodies and their Promise chains land in the same batch. Per-request instantiation prevents cross-request cache pollution; tenant ID in the SQL filter prevents cross-tenant data leaks. APQ saves wire bytes but does not constrain query shape; trusted documents do. Field-count complexity scoring misses list-expansion cost — multiply by first/last arguments to catch 100^5-row queries. Alias bombs bypass rate limits via resolver-count amplification; cap at ≤20 aliases. Wire resolver call counts to OpenTelemetry spans: without per-resolver tracing, N+1 regressions are invisible until the database pages someone.

Connected lessons
appears again in202
Continue the climb ↑GraphQL N+1: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources5
expand
  1. 01
  2. 02
  3. 03
  4. 04
  5. 05

Trademarks belong to their respective owners. Editorial reference only.