awesome-everything RU
↑ Back to the climb

Backend Architecture

Observability, production failures, and global-scale design

Crux Minimum viable idempotency dashboard, production failure stories from Stripe/Knight Capital/AWS S3/GitHub, cross-protocol patterns, and a global-anycast design exercise.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

Your payment API starts returning 422 on 10% of POST /charge requests. The symptom appeared after an SDK release. No double charges yet — but the signal is a client bug that, if left unfixed, will cause them. The dashboard needs to surface this before it becomes a production incident.

Minimum viable observability dashboard

Every production idempotency + retry system needs six metrics:

1. idempotency_key_total sliced by outcome:

  • new — first-time keys (healthy baseline)
  • replay — successful retries (expected under transient failures)
  • in_flight (409) — concurrent racing requests
  • mismatch (422) — client reusing key with different intent (bug signal)

A rate of mismatch > 0.1% of total POSTs pages on-call. It means clients have a key-generation bug.

2. retry_attempt_total sliced by attempt number (1, 2, 3, 4+): A spike in attempt-3 retries precedes an outage by minutes — the most persistent clients are already paying the cost of an underlying degradation before it becomes user-visible.

3. retry_delay_seconds_bucket — histogram of actual wait times. A bimodal distribution (lots of retries at 0s and at 30s) suggests jitter is not applied.

4. outbox_lag_seconds — how stale is the outbox relay? Alert at lag > 30s. A growing lag means the relay or downstream broker is failing.

5. dead_letter_queue_depth — alert on growth. A growing DLQ is a permanently failing pipeline — poison messages are accumulating.

6. http_request_outcome sliced by status class — 2xx / 4xx / 5xx / timeout. The 4xx / 5xx split distinguishes client bugs from server failures.

MetricAlert thresholdWhat it signals
idempotency_key_total{outcome="mismatch"}> 0.1% of POSTsClient SDK key-reuse bug
retry_attempt_total{attempt="3"}Spike above baselineEarly signal of downstream degradation
outbox_lag_seconds> 30 sRelay or broker failure
dead_letter_queue_depthAny growthPermanent pipeline failure

Production failure stories

Stripe 2017: switching the idempotency cache from one Redis cluster to another briefly lost in-flight keys. A handful of double charges resulted. Stripe added Postgres as the permanent authoritative backing store — Redis is now only the hot-path cache.

Knight Capital 2012: an absent feature flag plus aggressive retries against an exchange feed turned a deployment glitch into $440M in unintended trades in 45 minutes. Root cause: at-least-once delivery without idempotent consumers.

AWS S3 September 2017: an internal subsystem returned 500s during an outage. Aggressive retry policies without jitter in clients amplified the failure across the region until AWS published guidance requiring jitter for all SDK retries.

GitHub 2018: a database failover lost the lease on an outbox-relay process. Events accumulated in the outbox for 2 minutes, then all flushed at once — a self-inflicted thundering herd on the consumer side. Mitigation: rate-limit the relay’s publish rate during catch-up (e.g., 500 events/s) to flatten the spike.

Every story has the same root: at-least-once delivery without idempotent consumers, OR retries without jitter.

Cross-protocol: HTTP, gRPC, Kafka

The idempotency token concept is universal across protocols:

ProtocolTokenWhere
HTTPIdempotency-Key headerRFC draft draft-ietf-httpapi-idempotency-key-header
gRPCgrpc-retry-pushback-ms trailerService-config retryPolicy JSON
Kafkaenable.idempotence=true producerPer-partition sequence numbers
AWS SQS FIFOMessageDeduplicationIdQueue-level dedup window (5 min)

Kafka idempotent producer: enable.idempotence=true assigns a Producer ID and sequence number to each message. The broker rejects duplicates within a session. For cross-partition atomicity: Kafka transactions.

Deployment: middleware, not hand-rolled

Hand-rolling idempotency in every endpoint produces inconsistent fingerprint formulas and incompatible TTLs. Middleware libraries do it once:

  • express-idempotency (Node.js)
  • django-idempotency-key (Python/Django)
  • fastapi-idempotency (Python/FastAPI)
  • AspNetCore.Idempotency (.NET)
  • Stripe’s idempotent-requests (open-sourced)

At the infrastructure layer: AWS API Gateway has built-in idempotency since 2023 (TTL up to 10 minutes per stage). Envoy/Istio expose retry policy per route with retry budgets shared across the cluster.

Why this works

Why did the IETF take until 2024 to formalize the Idempotency-Key header? Stripe published the pattern in 2014; adoption was widespread before standardization. The IETF process (draft-ietf-httpapi-idempotency-key-header through multiple revisions) exists to interoperably specify the header semantics — especially TTL negotiation and fingerprint behavior — so that API gateways and client SDKs can implement it consistently without reading Stripe’s blog post.

Quiz

Your payment service's `idempotency_key_total{outcome='mismatch'}` rises to 10% of all POST /charge requests. What is the most likely root cause?

Quiz

The outbox relay has 18,000 unpublished rows after a Kafka broker failure. The broker recovers. What risk does the relay face when it resumes publishing at full speed?

Quiz

A global anycast payment API needs idempotency keys to survive a single-region failure. Which architecture satisfies this requirement?

Recall before you leave
  1. 01
    What six metrics form the minimum viable idempotency + retry dashboard, and what does each signal?
  2. 02
    Explain the design of a global-anycast idempotency cache with double-charge probability ≤ 10⁻⁹ and 30-day key retention.
  3. 03
    What is the root cause shared by all four production failure stories (Stripe 2017, Knight Capital 2012, AWS S3 2017, GitHub 2018)?
Recap

A production idempotency + retry system needs six observability metrics — key outcome distribution, retry attempt distribution, outbox lag, DLQ depth, and HTTP status class breakdown. A 422 mismatch spike above 0.1% signals a client SDK key-reuse bug. A retry attempt-3 surge is an early downstream degradation signal. All four major production failures (Stripe 2017, Knight Capital 2012, AWS S3 2017, GitHub 2018) share the same root: at-least-once delivery without idempotent consumers, or retries without jitter. At global scale, the cache needs active-active cross-region replication so a region failure does not lose keys and create double charges. Use middleware libraries rather than hand-rolling per endpoint — inconsistent fingerprint formulas are how bugs hide.

Connected lessons
appears again in285
Continue the climb ↑Idempotency and retries: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.