Backend Architecture BE · 05 · 06

Observability, production failures, and global-scale design

Minimum viable idempotency dashboard, production failure stories from Stripe/Knight Capital/AWS S3/GitHub, cross-protocol patterns, and a global-anycast design exercise.

BE Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

Your payment API starts returning 422 on 10% of POST /charge requests. The symptom appeared after an SDK release. No double charges yet — but the signal is a client bug that, if left unfixed, will cause them. The dashboard needs to surface this before it becomes a production incident.

Minimum viable observability dashboard

A 422 spike that you catch in your dashboard at 10% is a client SDK bug. Left unmonitored, it is a double-charge incident waiting to happen. These six metrics are the difference.

Every production idempotency + retry system needs six metrics:

1. idempotency_key_total sliced by outcome:

new — first-time keys (healthy baseline)
replay — successful retries (expected under transient failures)
in_flight (409) — concurrent racing requests
mismatch (422) — client reusing key with different intent (bug signal)

A rate of mismatch > 0.1% of total POSTs pages on-call. It means clients have a key-generation bug.

2. retry_attempt_total sliced by attempt number (1, 2, 3, 4+): A spike in attempt-3 retries precedes an outage by minutes — the most persistent clients are already paying the cost of an underlying degradation before it becomes user-visible.

3. retry_delay_seconds_bucket — histogram of actual wait times. A bimodal distribution (lots of retries at 0s and at 30s) suggests jitter is not applied.

4. outbox_lag_seconds — how stale is the outbox relay? Alert at lag > 30s. A growing lag means the relay or downstream broker is failing.

5. dead_letter_queue_depth — alert on growth. A growing DLQ is a permanently failing pipeline — poison messages are accumulating.

6. http_request_outcome sliced by status class — 2xx / 4xx / 5xx / timeout. The 4xx / 5xx split distinguishes client bugs from server failures.

Metric	Alert threshold	What it signals
`idempotency_key_total{outcome="mismatch"}`	> 0.1% of POSTs	Client SDK key-reuse bug
`retry_attempt_total{attempt="3"}`	Spike above baseline	Early signal of downstream degradation
`outbox_lag_seconds`	> 30 s	Relay or broker failure
`dead_letter_queue_depth`	Any growth	Permanent pipeline failure

Production failure stories

Stripe 2017: switching the idempotency cache from one Redis cluster to another briefly lost in-flight keys. A handful of double charges resulted. Stripe added Postgres as the permanent authoritative backing store — Redis is now only the hot-path cache.

Knight Capital 2012: an absent feature flag plus aggressive retries against an exchange feed turned a deployment glitch into $440M in unintended trades in 45 minutes. Root cause: at-least-once delivery without idempotent consumers.

AWS S3 September 2017: an internal subsystem returned 500s during an outage. Aggressive retry policies without jitter in clients amplified the failure across the region until AWS published guidance requiring jitter for all SDK retries.

GitHub 2018: a database failover lost the lease on an outbox-relay process. Events accumulated in the outbox for 2 minutes, then all flushed at once — a self-inflicted thundering herd on the consumer side. Mitigation: rate-limit the relay’s publish rate during catch-up (e.g., 500 events/s) to flatten the spike.

Every story has the same root: at-least-once delivery without idempotent consumers, OR retries without jitter.

Four famous incidents collapse onto one three-part fix: idempotent consumers, jittered retries, and a rate-limited relay catch-up.

Cross-protocol: HTTP, gRPC, Kafka

The idempotency token concept is universal across protocols:

Protocol	Token	Where
HTTP	`Idempotency-Key` header	RFC draft `draft-ietf-httpapi-idempotency-key-header`
gRPC	`grpc-retry-pushback-ms` trailer	Service-config `retryPolicy` JSON
Kafka	`enable.idempotence=true` producer	Per-partition sequence numbers
AWS SQS FIFO	`MessageDeduplicationId`	Queue-level dedup window (5 min)

Kafka idempotent producer: enable.idempotence=true assigns a Producer ID and sequence number to each message. The broker rejects duplicates within a session. For cross-partition atomicity: Kafka transactions.

Deployment: middleware, not hand-rolled

Hand-rolling idempotency in every endpoint produces inconsistent fingerprint formulas and incompatible TTLs. Middleware libraries do it once:

express-idempotency (Node.js)
django-idempotency-key (Python/Django)
fastapi-idempotency (Python/FastAPI)
AspNetCore.Idempotency (.NET)
Stripe’s idempotent-requests (open-sourced)

At the infrastructure layer: AWS API Gateway has built-in idempotency since 2023 (TTL up to 10 minutes per stage). Envoy/Istio expose retry policy per route with retry budgets shared across the cluster.

▸Why this works

Why did the IETF take until 2024 to formalize the Idempotency-Key header? Stripe published the pattern in 2014; adoption was widespread before standardization. The IETF process (draft-ietf-httpapi-idempotency-key-header through multiple revisions) exists to interoperably specify the header semantics — especially TTL negotiation and fingerprint behavior — so that API gateways and client SDKs can implement it consistently without reading Stripe’s blog post.

Quiz

Your payment service's `idempotency_key_total{outcome='mismatch'}` rises to 10% of all POST /charge requests. What is the most likely root cause?

Quiz

The outbox relay has 18,000 unpublished rows after a Kafka broker failure. The broker recovers. What risk does the relay face when it resumes publishing at full speed?

Quiz

A global anycast payment API needs idempotency keys to survive a single-region failure. Which architecture satisfies this requirement?

A 422 spike or attempt-3 surge precedes the user-visible outage by minutes — instrument these before launch.

Recall before you leave

01
What six metrics form the minimum viable idempotency + retry dashboard, and what does each signal?
02
Explain the design of a global-anycast idempotency cache with double-charge probability ≤ 10⁻⁹ and 30-day key retention.
03
What is the root cause shared by all four production failure stories (Stripe 2017, Knight Capital 2012, AWS S3 2017, GitHub 2018)?

Recap

A production idempotency + retry system needs six observability metrics — key outcome distribution, retry attempt distribution, outbox lag, DLQ depth, and HTTP status class breakdown. A 422 mismatch spike above 0.1% signals a client SDK key-reuse bug. A retry attempt-3 surge is an early downstream degradation signal. All four major production failures (Stripe 2017, Knight Capital 2012, AWS S3 2017, GitHub 2018) share the same root: at-least-once delivery without idempotent consumers, or retries without jitter. At global scale, the cache needs active-active cross-region replication so a region failure does not lose keys and create double charges. Use middleware libraries rather than hand-rolling per endpoint — inconsistent fingerprint formulas are how bugs hide. Now when you instrument a new payment endpoint, add the six metrics before the first production request lands — a 422 spike is only useful if you are already watching.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

appears again in314

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.