awesome-everything RU
↑ Back to the climb

Distributed Systems

Sagas: build a fault-tolerant order saga

Crux Hands-on project — build an orchestrated order saga across three services, write its compensations, make every step idempotent, and prove it survives injected failures and redeliveries.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about compensations and idempotency is not the same as watching a saga survive a process kill at the worst possible instant. Build a small order saga across three services, then break it on purpose — fail steps, redeliver messages, crash the orchestrator mid-flow — and prove the system always lands in a consistent state.

Goal

Turn the unit’s model into a working saga: orchestrate three local-commit steps, write a real reversing compensation for each, make every step and compensation idempotent under at-least-once delivery, defend against a concurrent-saga anomaly, and verify the whole thing with injected faults rather than the happy path.

Project
0 of 7
Objective

Implement an orchestrated order saga across three services (order, payment, inventory) with durable step state, real compensations, idempotent handlers, and one isolation countermeasure — then demonstrate it reaches a consistent end state under every injected failure.

Requirements
Acceptance criteria
  • A scenario table: for each injected fault (step failure, compensation retry, duplicate delivery, mid-saga crash) the recorded end state of all three services and a pass/fail on the consistency invariant.
  • Logs or a trace showing a redelivered charge message is detected and skipped — the payment service records exactly one charge for the saga.
  • Evidence the orchestrator resumes after a kill: the saga that crashed mid-flow either completes or fully compensates on restart, with no orphaned reservation or charge.
  • A short write-up naming, for each step, its compensation and whether that compensation is a true undo or an approximation, plus the one isolation anomaly your countermeasure prevents and why.
Senior stretch
  • Add a timeout-and-retry policy with capped attempts to each step, then a dead-letter path: a step that exhausts retries triggers compensation of the whole saga, and show it works under an unreachable downstream service.
  • Re-implement the orchestration as choreography (services react to events with no coordinator) and write a paragraph on what got harder — tracing a stuck saga, adding a step, persisting the long-wait state.
  • Add a manual-approval step that can pause the saga for an arbitrary time, and show the durable state survives a full restart of every service while the saga is parked.
  • Add a reconciliation job that scans for sagas stuck in a non-terminal state past an SLA and either re-drives or compensates them, simulating the on-call tool you would actually need in production.
Recap

This is the saga you will actually build in production, in miniature: local commits with no global transaction, a real reversing compensation for every step, the irreversible step ordered last, durable progress so a crash resumes instead of orphaning work, idempotent handlers because delivery is at-least-once, and one application-level isolation countermeasure for concurrent sagas. Proving it with injected faults — not the happy path — is what turns ‘I read about sagas’ into ‘I have run one through failure and watched it land consistent’.

Continue the climb ↑Retry amplification: how 3 retries per layer becomes a metastable outage
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.