Distributed Systems DIST · 06 · 10

Sagas: build a fault-tolerant order saga

Hands-on project — build an orchestrated order saga across three services, write its compensations, make every step idempotent, and prove it survives injected failures and redeliveries.

DIST Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about compensations and idempotency is not the same as watching a saga survive a process kill at the worst possible instant. Build a small order saga across three services, then break it on purpose — fail steps, redeliver messages, crash the orchestrator mid-flow — and prove the system always lands in a consistent state.

Goal

Turn the unit’s model into a working saga: orchestrate three local-commit steps, write a real reversing compensation for each, make every step and compensation idempotent under at-least-once delivery, defend against a concurrent-saga anomaly, and verify the whole thing with injected faults rather than the happy path.

Project

0 of 7

Objective

Implement an orchestrated order saga across three services (order, payment, inventory) with durable step state, real compensations, idempotent handlers, and one isolation countermeasure — then demonstrate it reaches a consistent end state under every injected failure.

Requirements

Acceptance criteria

A scenario table: for each injected fault (step failure, compensation retry, duplicate delivery, mid-saga crash) the recorded end state of all three services and a pass/fail on the consistency invariant.
Logs or a trace showing a redelivered charge message is detected and skipped — the payment service records exactly one charge for the saga.
Evidence the orchestrator resumes after a kill: the saga that crashed mid-flow either completes or fully compensates on restart, with no orphaned reservation or charge.
A short write-up naming, for each step, its compensation and whether that compensation is a true undo or an approximation, plus the one isolation anomaly your countermeasure prevents and why.

Senior stretch

Add a timeout-and-retry policy with capped attempts to each step, then a dead-letter path: a step that exhausts retries triggers compensation of the whole saga, and show it works under an unreachable downstream service.
Re-implement the orchestration as choreography (services react to events with no coordinator) and write a paragraph on what got harder — tracing a stuck saga, adding a step, persisting the long-wait state.
Add a manual-approval step that can pause the saga for an arbitrary time, and show the durable state survives a full restart of every service while the saga is parked.
Add a reconciliation job that scans for sagas stuck in a non-terminal state past an SLA and either re-drives or compensates them, simulating the on-call tool you would actually need in production.

Recap

This is the saga you will actually build in production, in miniature: local commits with no global transaction, a real reversing compensation for every step, the irreversible step ordered last, durable progress so a crash resumes instead of orphaning work, idempotent handlers because delivery is at-least-once, and one application-level isolation countermeasure for concurrent sagas. Proving it with injected faults — not the happy path — is what turns ‘I read about sagas’ into ‘I have run one through failure and watched it land consistent’.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.