APIs
Designing and reviewing a public API end to end
The outage starts with a noun. Someone modeled POST /chargeCard instead of a payments resource, so “did this charge already happen?” had no answer the API could give. The endpoint returned 200 on a duplicate. Mobile clients, seeing a flaky network, retried — and 200 is not retry-safe, so cards were charged twice. There was no OpenAPI contract, so QA never noticed the response shape drift. There was no rate limit, so a retry storm at 6pm hit the database with 40k duplicate writes a second. Six bad decisions, one cascade. Every one of them was reviewable a month earlier.
The cascade: how one bad decision becomes an outage
The whole APIs track was teaching one thing from different angles: an API is a contract, and contracts fail at the seams between decisions, not inside any one of them. A senior reviewing a public API does not check items in isolation — they trace the cascade, because each weak link load-bears the next.
Start with resource modeling. Model nouns, not verbs: POST /payments, not POST /chargeCard. The verb is the HTTP method; the URL is the thing. An action that is not a CRUD verb (publish, refund, retry) becomes a sub-resource — POST /payments/{id}/refunds — not a ?action=refund query param and not a leaked database column. The moment your URLs mirror your DB schema (/user_accounts_v2), you have welded clients to your internal storage and every migration becomes a breaking change.
That modeling decision determines whether your status codes can even be honest. A real payments resource lets you return 201 Created with a Location on first write and 200/409 on a retried idempotency key. A chargeCard RPC has nothing to point at, so it returns 200 for everything — and 200 tells the client “done, do not retry,” which is a lie when the network ate the response.
Status codes are a retry contract, not decoration
Clients do not read your docs at 6pm during an incident — they read your status code and your headers, and they act on them automatically. That makes the status code a machine-readable contract:
2xx→ done, do not retry.4xx→ your request is wrong; retrying unchanged is pointless (400,404,422).429/503→ retryable, but back off — readRetry-After.5xx→ maybe retryable, but only if the operation is idempotent.
The dangerous case is a non-idempotent POST that fails after the write but before the response. The fix is an Idempotency-Key: the client sends a UUID, the server records it, and a retry with the same key returns the original result instead of charging again. This is why Stripe-style payment APIs require the header — it converts an unsafe retry into a safe one. Without it, the only correct status code for “I do not know if it worked” is no status code at all, because the client will guess wrong.
| Layer | Get it right | Get it wrong → next failure |
|---|---|---|
| Resource modeling | Nouns; non-CRUD actions as sub-resources; no leaked DB schema | RPC verbs with nothing to address → status codes can’t be honest |
| Status codes | 201/409/429 with Idempotency-Key + Retry-After | 200 for everything → clients retry unsafe writes |
| Pagination | Cursor (keyset) for large/live data | Deep OFFSET → full scans, drift on insert |
| Contract (OpenAPI) | Spec-first; breaking-change diff in CI | No spec → response drift ships silently |
| Versioning | Additive by default; version only on breaks; Sunset header | Mutate /v1 in place → existing clients break overnight |
| Rate limiting | 429 + Retry-After; documented quota | No limit → one retry storm becomes an outage |
Pagination and the contract that catches drift
OFFSET 100000 LIMIT 20 makes the database count and discard 100k rows on every deep page — a full scan that gets slower the further you go, and that silently skips or duplicates rows when items are inserted mid-pagination. Cursor (keyset) pagination — WHERE id > :last_seen ORDER BY id LIMIT 20 — stays O(page size) at any depth and is stable under inserts, at the cost of no random “jump to page 500.” For large or live datasets you ship cursors; offset is fine only for small, stable, admin-grade lists.
None of this stays correct without a contract. Spec-first OpenAPI means the schema is the source of truth, generated clients and server stubs both derive from it, and — the part that actually saves you — a CI job diffs the new spec against the old one and fails the build on a breaking change: a removed field, a tightened enum, a type change, a renamed error envelope. Without that diff, the drift that broke QA in the hook ships in a normal deploy with green tests, because the tests were written against the old shape and nobody updated them.
Why this works
“Backward compatible” has a precise meaning here: an old client running unmodified against the new API still works. Adding an optional field, a new endpoint, or a new enum value the client can ignore is additive and safe. Removing a field, renaming one, tightening validation, or changing a default is breaking — even if it “feels small” — because some client somewhere depends on the exact old behavior. The CI diff exists because humans consistently misjudge which bucket a change is in.
Versioning is the connective tissue; protocol is a separate axis
Versioning is what lets the whole system evolve without the cascade. The senior default is additive change with no version bump: new optional fields and endpoints go straight into /v1. You cut a new major version (/v2, or a Accept: application/vnd.api.v2+json media type) only for genuine breaks, and you keep the old one alive. URL versioning (/v1/...) wins for public APIs — visible, cache-friendly, trivial to route — while header/media-type versioning keeps URLs clean for internal consumers. Either way, deprecation is a published policy, not a surprise: announce the Deprecation and Sunset headers (RFC 8594/9745), give clients 6–12 months, and watch usage metrics before you actually remove the old version.
The protocol choice — REST vs gRPC — is an orthogonal axis, not part of the versioning ladder, and it is where seniors most often over-engineer. gRPC is genuinely faster: roughly 77% lower latency on small payloads, ~10x smaller serialized messages (protobuf vs JSON), and ~50k req/s vs ~20k for REST in synthetic benchmarks, with streaming as a first-class feature. But those numbers are the answer to “internal service-to-service at high volume,” not “should my public API be gRPC.” Public APIs live or die on discoverability, browser/curl reachability, and external-developer ergonomics — exactly REST + OpenAPI’s strengths. The senior move is REST at the edge, gRPC between your own services if the volume justifies it, and never letting a benchmark choose your public contract.
Two of your services exchange ~40k requests/sec internally with tiny payloads and a streaming feed; you also expose a public API to third-party developers. Pick the protocol split.
A mobile client's POST to create a payment times out. The server may or may not have committed the write. What's the senior design that makes the client's retry safe?
Your API returns a list that's now 2M rows and growing, browsed live. The endpoint uses OFFSET/LIMIT and is getting slower on deep pages. Best fix?
Order a senior's end-to-end public-API review checklist (each step de-risks the next):
- 1 Resources are nouns; non-CRUD actions are sub-resources; no DB schema leaked into URLs
- 2 Status codes are honest: 201/409/429, Idempotency-Key on unsafe writes, Retry-After on throttles
- 3 Lists use cursor pagination for large/live data, not deep OFFSET
- 4 Spec-first OpenAPI is the source of truth, with a breaking-change diff in CI
- 5 Changes are additive by default; real breaks bump a version with a published Sunset policy
- 6 Rate limits return 429 + Retry-After with a documented quota, so load can't become an outage
- 01Walk through how a single resource-modeling mistake cascades into a production outage, naming each link in the chain.
- 02A teammate wants to make the public API gRPC because benchmarks show it's far faster. How do you reason about whether that's the right call?
The APIs track was always one lesson seen from many sides: an API is a connected contract, and it fails at the seams between decisions. Model nouns and turn non-CRUD actions into sub-resources, because that’s what lets status codes be honest. Make status codes a machine-readable retry contract — 201/409/429, Idempotency-Key on unsafe writes, Retry-After on throttles — because clients act on codes, not docs, during an incident. Paginate large or live lists by cursor, not deep OFFSET, so cost stays flat and pages stay stable. Lock the whole shape in spec-first OpenAPI with a CI breaking-change diff, so drift can’t ship silently. Evolve additively and version only on real breaks, with a published Deprecation/Sunset policy and 6–12 months of runway. Gate everything with documented rate limits returning 429 + Retry-After, so a retry storm can’t turn into an outage. Keep protocol on its own axis: REST + OpenAPI at the public edge, gRPC between internal services when the volume genuinely justifies it. Review it as a cascade, not a checklist.