awesome-everything RU
↑ Back to the climb

Distributed Systems

Distributed capstone: design a fault-tolerant pipeline

Crux Capstone build — design and harden a fault-tolerant order/payment pipeline that composes quorum, leader election, clocks, sagas, and retries, then prove it survives the seam failures.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about composition failures is not the same as building a pipeline that survives them. Design a small fault-tolerant order/payment service that wires every primitive from this track together, then inject the failures — a paused leader, a lost response, a retry storm — and prove the seams hold.

Goal

Turn the whole track into one running system: pick a consistency model, make writes durable with quorum, coordinate with an elected leader behind fencing tokens, reverse with sagas, bound retries with a budget, and make every cross-service effect idempotent — then verify it under injected failure, not on paper.

Project
0 of 7
Objective

Design and build a small but realistic order/payment pipeline spanning at least three services (Order, Payment, plus Inventory or Shipping) coordinated by a saga, and demonstrate — with injected failures — that it survives a paused stale leader, a lost response, and a retry storm without producing a double effect or a lost effect.

Requirements
Acceptance criteria
  • Under the paused-leader scenario, the stale leader's write is rejected by the fencing check and the state the new leader advanced is intact — shown by a log of the rejected token.
  • Under the lost-response scenario, the retried compensation produces exactly one effect: the receiver returns the first result on the duplicate key, proven by a single refund record despite two physical calls.
  • Under the retry storm, the retry budget sheds excess retries and the downstream service recovers — shown by a load graph where retry traffic stays bounded instead of amplifying.
  • A seam-signal dashboard (or logged equivalent) covering consumer lag, quorum write/read p99, leader churn, and retry-budget consumption, with a short note on which signal would have caught each injected failure first.
Senior stretch
  • Add a chaos test that randomly combines two failures at once (e.g. leader pause during a retry storm) and confirm no double or lost effect across a sustained run.
  • Add an on-call runbook: how to triage each seam signal, the most likely composition failure behind it, and the verification step that confirms the fix.
  • Swap the consistency model on one piece of state (e.g. move inventory from eventual to linearizable) and document the latency and availability cost you paid for the stronger guarantee.
  • Add a duplicate-effect counter (a metric that increments whenever a dedup key is hit a second time) and alert on it — turning a silent composition failure into a visible signal.
Recap

This is the system you will actually be asked to design and defend: a pipeline where each primitive is correct alone and the engineering is in the seams. You chose a consistency model per state, made writes durable with R + W > N, kept coordination single-writer-safe with an elected leader and fencing tokens, ordered steps causally, reversed with idempotent compensations, and bounded retries with a budget. The proof is not the diagram — it is the injected paused leader whose write is fenced out, the retried compensation that refunds exactly once, and the retry storm the budget contains. Build it once on a toy pipeline and the production version becomes muscle memory.

shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.