Distributed Systems DIST · 08 · 10

Distributed capstone: design a fault-tolerant pipeline

Capstone build — design and harden a fault-tolerant order/payment pipeline that composes quorum, leader election, clocks, sagas, and retries, then prove it survives the seam failures.

DIST Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about composition failures is not the same as building a pipeline that survives them. Design a small fault-tolerant order/payment service that wires every primitive from this track together, then inject the failures — a paused leader, a lost response, a retry storm — and prove the seams hold.

Goal

Turn the whole track into one running system: pick a consistency model, make writes durable with quorum, coordinate with an elected leader behind fencing tokens, reverse with sagas, bound retries with a budget, and make every cross-service effect idempotent — then verify it under injected failure, not on paper.

Project

0 of 7

Objective

Design and build a small but realistic order/payment pipeline spanning at least three services (Order, Payment, plus Inventory or Shipping) coordinated by a saga, and demonstrate — with injected failures — that it survives a paused stale leader, a lost response, and a retry storm without producing a double effect or a lost effect.

Requirements

Acceptance criteria

Under the paused-leader scenario, the stale leader's write is rejected by the fencing check and the state the new leader advanced is intact — shown by a log of the rejected token.
Under the lost-response scenario, the retried compensation produces exactly one effect: the receiver returns the first result on the duplicate key, proven by a single refund record despite two physical calls.
Under the retry storm, the retry budget sheds excess retries and the downstream service recovers — shown by a load graph where retry traffic stays bounded instead of amplifying.
A seam-signal dashboard (or logged equivalent) covering consumer lag, quorum write/read p99, leader churn, and retry-budget consumption, with a short note on which signal would have caught each injected failure first.

Senior stretch

Add a chaos test that randomly combines two failures at once (e.g. leader pause during a retry storm) and confirm no double or lost effect across a sustained run.
Add an on-call runbook: how to triage each seam signal, the most likely composition failure behind it, and the verification step that confirms the fix.
Swap the consistency model on one piece of state (e.g. move inventory from eventual to linearizable) and document the latency and availability cost you paid for the stronger guarantee.
Add a duplicate-effect counter (a metric that increments whenever a dedup key is hit a second time) and alert on it — turning a silent composition failure into a visible signal.

Recap

This is the system you will actually be asked to design and defend: a pipeline where each primitive is correct alone and the engineering is in the seams. You chose a consistency model per state, made writes durable with R + W > N, kept coordination single-writer-safe with an elected leader and fencing tokens, ordered steps causally, reversed with idempotent compensations, and bounded retries with a budget. The proof is not the diagram — it is the injected paused leader whose write is fenced out, the retried compensation that refunds exactly once, and the retry storm the budget contains. Build it once on a toy pipeline and the production version becomes muscle memory. Now when you sit down to design a distributed pipeline for real, the first checklist item is not “which services do I need” — it is “which seams carry a shared idempotency key, and how do I verify that under injected failure.”

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.