Distributed Systems
Putting it together: an order pipeline where the seams leak
A customer gets refunded twice for one cancelled order. Finance flags it; the on-call engineer pulls the trace. The payment service is idempotent, the saga orchestrator is textbook correct, the retry policy has backoff and jitter — every layer passes its own unit tests. The bug lives in none of them. The orchestrator fired a refund compensation, the call timed out at 30s while the refund actually succeeded, the retry budget had room, so it retried — and the refund step carried no idempotency key. Two correct layers, composed, produced a wrong number on a real customer’s statement.
The system: one order pipeline, six primitives
Everything in this track was a primitive in isolation. The capstone is one realistic system that uses all of them at once: an order/payment pipeline spanning four services — Order, Payment, Inventory, Shipping — coordinated by a saga. Walk the request and every prior lesson reappears as a layer:
- Quorum replication (
R + W > N): the Order service writes to a replicated store. WithN=3, W=2, R=2, a write that lands on 2 of 3 replicas survives one node loss and any later read overlaps it. That is what makes “order created” durable. - Leader election + fencing tokens: the saga orchestrator runs as a single leader so two coordinators don’t drive the same order. When the leader pauses (GC, network partition) and a new one is elected, the old one can wake up and still act — so every downstream write carries a monotonic fencing token, and resources reject any token lower than the highest they’ve seen.
- Clocks / ordering: you cannot trust wall-clock timestamps to order events across services. Logical clocks (or Spanner’s TrueTime with bounded uncertainty) order the saga’s steps; “Payment happened before Shipping” must be a causal fact, not a
Date.now()comparison. - Sagas + compensations: there is no distributed ACID transaction across four services. The saga runs forward steps and, on failure, runs compensations — semantic reversals (refund, restock), not rollbacks.
- Retry discipline: every cross-service call retries with exponential backoff plus jitter, under a retry budget, behind a circuit breaker.
Each layer is correct. The interesting failures are not inside any layer — they are in how the layers compose.
Idempotency is the load-bearing primitive
At-least-once delivery and retries are the default reality of distributed systems: a call that times out may have succeeded, and you cannot tell from the caller’s side. The only safe response to “did it happen?” being unknowable is to make “do it again” harmless. That is idempotency, and it is what makes every other layer in the pipeline safe.
The key word is key. Idempotency is not “the operation is naturally repeatable” — decrement stock by 1 is not idempotent. It is “the operation carries a stable business key, and the receiver records that key so a second arrival of the same key is a no-op that returns the first result.” The trap a senior watches for: keying on the wrong thing. Key on the HTTP request id and a retry generated by a different layer (the saga vs the SDK vs the client) gets a fresh id and slips through. The key must be derived from the business intent — refund:order-8842:attempt-1 — and threaded through every layer that might retry.
Why this works
“At-least-once” and “exactly-once” are not two delivery modes you pick between. Exactly-once delivery over a network is impossible; what people call “exactly-once” is really at-least-once delivery plus idempotent processing keyed on a dedup id. The network gives you duplicates; the key gives you exactly-once effects. Conflating the two is how teams ship a saga believing the broker guarantees no duplicates.
The composition failure: compensation races retry
Here is the bug from the Hook, mechanically. The saga decides to cancel and emits the refund compensation:
| t | Saga / retry layer | Payment service | Effect |
|---|---|---|---|
| 0s | emit refund compensation | receives request, starts refund | — |
| 28s | waiting on response | refund succeeds, response delayed | $50 returned (1st time) |
| 30s | timeout; budget OK → retry | receives 2nd request, no dedup key | $50 returned (2nd time) |
| 31s | retry returns 200 | — | double refund |
Trace each layer in isolation and it is blameless. The timeout was reasonable. The retry budget had room, so retrying was the correct policy decision. The compensation logic was right — a cancelled order should refund. The payment service was idempotent for the charge path. The defect is purely in the seam: a retry (one correct layer) re-invoked a compensation (another correct layer) across a call that lacked a shared idempotency key. Fix it by threading one key — refund:order-8842 — from the saga through the retry layer into the payment service, where the second arrival of that key returns the first refund’s result instead of issuing a new one. Nothing about the individual layers changes; the seam gains a shared key.
Observability is how you actually run it
The pipeline does not announce its own decay. Composition failures hide because each component’s own dashboard is green. You run the system on the signals that live between layers:
- Consumer lag — how far behind the saga’s event consumers are. Rising lag means the pipeline is falling behind producers; sagas stall mid-flight, and in-flight compensations pile up.
- Quorum write/read latency — p99 of the
W=2write. A slow third replica drags tail latency even when no node is technically “down”. - Leader churn — elections per minute. Frequent re-elections mean fencing tokens are bumping constantly and you are one pause away from a split-brain write attempt.
- Retry-budget exhaustion — the percentage of the budget consumed. When budget is near 100%, retries are about to be dropped, which surfaces as user-visible failures even though every downstream service is healthy. Budgets are typically set to allow retries to add at most ~10% on top of normal request rate, precisely so a retry storm cannot amplify a small fault into an outage.
The senior instinct: alert on the seams, not just the nodes. A pipeline of healthy services can still be failing customers.
Your order saga's `refund` compensation can be retried by the retry layer after a timeout. Where do you put the safeguard so a double refund is impossible?
Every service in the order pipeline passes its own tests, yet customers see occasional double charges. Where is the bug most likely to be?
The saga orchestrator runs as an elected leader. After a long GC pause it wakes up and tries to write a saga step. What stops it from corrupting state a new leader already advanced?
Order the steps to diagnose a composition failure (intermittent double refund) where every service is green:
- 1 Reproduce from a trace, not a single service log — composition bugs span layers
- 2 Find the seam: which two correct layers interacted (here: retry layer re-invoking a compensation)
- 3 Check whether a stable idempotency key crosses that seam — usually it does not
- 4 Thread one business-derived key through every layer that can retry the operation
- 5 Dedup at the receiver on that key, and add a seam-level alert (e.g. duplicate-effect counter)
- 01Walk a teammate through how a refund compensation and a retry combine into a double refund, even though both layers are individually correct, and how one change fixes it.
- 02Why is idempotency called the load-bearing primitive of the whole pipeline, and what makes a key correct versus a key that silently fails?
The capstone is not a new primitive — it is the realization that every primitive in this track is correct in isolation and the real failures live in the seams where they compose. One order/payment pipeline uses all of them: quorum replication (R + W > N) for durable order writes, an elected leader with fencing tokens so a paused orchestrator cannot corrupt state a new leader advanced, logical clocks or TrueTime to order steps causally, sagas with compensations because there is no distributed ACID transaction, and retries with backoff, jitter, and a budget capped near 10% over normal traffic. Idempotency keyed on business intent is the load-bearing primitive: at-least-once delivery makes duplicates inevitable, so the only safe response to an unknowable “did it happen?” is to make “do it again” harmless. The canonical composition failure is a refund compensation racing a retry across a call with no shared key, producing a double refund while every layer’s own tests stay green — fixed by threading one key through the seam, not by changing any layer. You run the whole thing on seam signals — consumer lag, quorum latency, leader churn, retry-budget exhaustion — because a pipeline of healthy services can still be failing real customers.