Distributed Systems
Distributed capstone: design a fault-tolerant pipeline
Reading about composition failures is not the same as building a pipeline that survives them. Design a small fault-tolerant order/payment service that wires every primitive from this track together, then inject the failures — a paused leader, a lost response, a retry storm — and prove the seams hold.
Turn the whole track into one running system: pick a consistency model, make writes durable with quorum, coordinate with an elected leader behind fencing tokens, reverse with sagas, bound retries with a budget, and make every cross-service effect idempotent — then verify it under injected failure, not on paper.
Design and build a small but realistic order/payment pipeline spanning at least three services (Order, Payment, plus Inventory or Shipping) coordinated by a saga, and demonstrate — with injected failures — that it survives a paused stale leader, a lost response, and a retry storm without producing a double effect or a lost effect.
- Under the paused-leader scenario, the stale leader's write is rejected by the fencing check and the state the new leader advanced is intact — shown by a log of the rejected token.
- Under the lost-response scenario, the retried compensation produces exactly one effect: the receiver returns the first result on the duplicate key, proven by a single refund record despite two physical calls.
- Under the retry storm, the retry budget sheds excess retries and the downstream service recovers — shown by a load graph where retry traffic stays bounded instead of amplifying.
- A seam-signal dashboard (or logged equivalent) covering consumer lag, quorum write/read p99, leader churn, and retry-budget consumption, with a short note on which signal would have caught each injected failure first.
- Add a chaos test that randomly combines two failures at once (e.g. leader pause during a retry storm) and confirm no double or lost effect across a sustained run.
- Add an on-call runbook: how to triage each seam signal, the most likely composition failure behind it, and the verification step that confirms the fix.
- Swap the consistency model on one piece of state (e.g. move inventory from eventual to linearizable) and document the latency and availability cost you paid for the stronger guarantee.
- Add a duplicate-effect counter (a metric that increments whenever a dedup key is hit a second time) and alert on it — turning a silent composition failure into a visible signal.
This is the system you will actually be asked to design and defend: a pipeline where each primitive is correct alone and the engineering is in the seams. You chose a consistency model per state, made writes durable with R + W > N, kept coordination single-writer-safe with an elected leader and fencing tokens, ordered steps causally, reversed with idempotent compensations, and bounded retries with a budget. The proof is not the diagram — it is the injected paused leader whose write is fenced out, the retried compensation that refunds exactly once, and the retry storm the budget contains. Build it once on a toy pipeline and the production version becomes muscle memory.