Distributed Systems
Sagas: build a fault-tolerant order saga
Reading about compensations and idempotency is not the same as watching a saga survive a process kill at the worst possible instant. Build a small order saga across three services, then break it on purpose — fail steps, redeliver messages, crash the orchestrator mid-flow — and prove the system always lands in a consistent state.
Turn the unit’s model into a working saga: orchestrate three local-commit steps, write a real reversing compensation for each, make every step and compensation idempotent under at-least-once delivery, defend against a concurrent-saga anomaly, and verify the whole thing with injected faults rather than the happy path.
Implement an orchestrated order saga across three services (order, payment, inventory) with durable step state, real compensations, idempotent handlers, and one isolation countermeasure — then demonstrate it reaches a consistent end state under every injected failure.
- A scenario table: for each injected fault (step failure, compensation retry, duplicate delivery, mid-saga crash) the recorded end state of all three services and a pass/fail on the consistency invariant.
- Logs or a trace showing a redelivered charge message is detected and skipped — the payment service records exactly one charge for the saga.
- Evidence the orchestrator resumes after a kill: the saga that crashed mid-flow either completes or fully compensates on restart, with no orphaned reservation or charge.
- A short write-up naming, for each step, its compensation and whether that compensation is a true undo or an approximation, plus the one isolation anomaly your countermeasure prevents and why.
- Add a timeout-and-retry policy with capped attempts to each step, then a dead-letter path: a step that exhausts retries triggers compensation of the whole saga, and show it works under an unreachable downstream service.
- Re-implement the orchestration as choreography (services react to events with no coordinator) and write a paragraph on what got harder — tracing a stuck saga, adding a step, persisting the long-wait state.
- Add a manual-approval step that can pause the saga for an arbitrary time, and show the durable state survives a full restart of every service while the saga is parked.
- Add a reconciliation job that scans for sagas stuck in a non-terminal state past an SLA and either re-drives or compensates them, simulating the on-call tool you would actually need in production.
This is the saga you will actually build in production, in miniature: local commits with no global transaction, a real reversing compensation for every step, the irreversible step ordered last, durable progress so a crash resumes instead of orphaning work, idempotent handlers because delivery is at-least-once, and one application-level isolation countermeasure for concurrent sagas. Proving it with injected faults — not the happy path — is what turns ‘I read about sagas’ into ‘I have run one through failure and watched it land consistent’.