Distributed Systems
Distributed capstone: multiple-choice review
Six questions that cut across the whole track. Each one is a decision an on-call engineer makes when a pipeline of individually-correct services still produces a wrong number on a customer’s statement. None of them is a definition to recite.
Confirm you can connect quorum replication, leader election with fencing, logical clocks, saga compensations, and retry discipline into one system — and locate the seam where two correct layers compose into a bug.
An order pipeline of four services double-refunds a customer occasionally. Every service passes its own unit tests; the payment service is idempotent for charges; the saga is textbook; retries have backoff and jitter. Where does the bug live?
The saga orchestrator runs as an elected leader. It suffers a long GC pause, its lease expires, a new leader is elected — then the old one wakes and tries to write a saga step. What actually prevents it from corrupting state the new leader already advanced?
A teammate proposes ordering saga steps across services by comparing wall-clock timestamps (Date.now()) so that 'Payment before Shipping' is enforced. Why does a senior reject this, and what is the correct mechanism?
Why is idempotency called the load-bearing primitive of the pipeline, rather than just one safeguard among many?
A retry storm is amplifying a small downstream fault into a full outage: every layer retries the layer below it. What is the structural fix, and what number anchors it?
Every service dashboard is green, yet the pipeline is silently failing customers. Which signal set does a senior actually watch to catch composition failures?
The track’s through-line is one decision tree: quorum makes a write durable, a leader with fencing tokens makes coordination single-writer-safe, logical clocks order steps causally, sagas reverse with compensations because there is no distributed ACID, and retries need a budget so a fault cannot amplify. Idempotency keyed on business intent is the load-bearing primitive that makes at-least-once safe everywhere. The real failures live in the seams — a retry re-firing a compensation with no shared key — and you catch them on seam signals, not on green per-service dashboards.