Distributed Systems DIST · 06 · 01

Sagas: long-lived transactions across services without 2PC

When a workflow spans flight, hotel, and car services, you cannot hold one ACID transaction across all of them. A saga is a chain of local commits, each with a compensating undo — buying availability at the cost of isolation.

DIST Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A booking flow charges the card, reserves the flight, then tries the hotel — and the hotel service is down. In a monolith you would ROLLBACK and walk away clean. Here there is no rollback: the card is already charged in the payments database, the seat already held in the flight database. Three services, three databases, three commits that already happened. The only way back is forward — issue a refund, release the seat — and you have to write that undo by hand, for every step, for every failure point.

Why two-phase commit doesn’t survive microservices

The textbook answer to a cross-database transaction is two-phase commit (2PC): a coordinator asks every participant to prepare, and once all vote yes it tells them to commit. It is correct, and in a microservices topology it is a trap. During the prepare phase every participant holds its locks open, waiting for the coordinator’s verdict — across a network, across services owned by different teams. If the coordinator dies after prepare but before commit, participants are stuck in-doubt: locked, unable to commit or abort, until the coordinator recovers. The coordinator is a blocking single point of failure, and the locks it forces are held for the duration of a network round trip to the slowest service, not a local disk write.

That is fine for two tables in one Postgres instance. It is fatal for a request path through six services where one is a third-party payment gateway you cannot enroll in your transaction at all. So you give up the global transaction and accept a different deal: each service commits locally, immediately, and you stitch the steps together with messages instead of locks.

Both keep your data consistent eventually; 2PC pays with held locks and a blocking coordinator, a saga pays by giving up the I in ACID — leaving dirty reads and lost updates for you to fix.

The saga: local commits plus compensations

A saga is a sequence of local transactions. Each step updates one service’s database and emits a message that triggers the next step. There is no global commit — every T1, T2, T3 is durable the instant it runs. The price of that immediacy is that there is no automatic undo, so every forward step Ti ships with a compensating transaction Ci that semantically reverses it. If step T3 fails, the saga runs C2 then C1, in reverse order, to walk the world back to a consistent state.

The trip-booking example makes the shape concrete: T1 book flight, T2 book hotel, T3 rent car. If the car step fails, you compensate in reverse — C2 cancel hotel, C1 cancel flight. Compensations run in the opposite order to forward steps because later steps may depend on earlier ones.

Forward step	Compensation	True undo?
`T1` book flight	`C1` cancel flight	No — may incur a fee
`T2` book hotel	`C2` cancel hotel	No — a new cancellation, not a delete
`T3` rent car	`C3` cancel car	No — the rental already happened

The “True undo?” column is the part juniors miss. A compensation is not a rollback; it is a new business action that approximates undoing the old one. Cancelling a flight does not erase the booking — it incurs a fee and leaves a record. Even where a clean cancel exists, some steps have no compensation at all: a refund is not the inverse of a charge (the money moved twice, the gateway took a fee both ways), and a sent confirmation email cannot be unsent. The senior design rule that falls out of this: order your steps so the irreversible ones go last. Charge the card and send the email only after every step that might fail has already succeeded.

Choreography vs orchestration

There are two ways to wire the steps together, and choosing wrong is a multi-quarter regret.

In choreography, there is no coordinator. Each service listens for events and reacts: payments hears FlightBooked, charges the card, emits CardCharged; hotel hears that and books a room. It is decentralized and has no single point of failure — but the saga’s logic exists nowhere as a whole. To answer “what happens after a card is charged?” you grep four codebases. Add a fifth step and you touch three services. Cyclic event dependencies sneak in. Tracing a stuck booking means reconstructing a distributed sequence from logs across services.

In orchestration, a central orchestrator owns the workflow as explicit code: it sends BookFlight, waits for the reply, sends ChargeCard, and on any failure drives the compensations. The whole saga is readable in one place and one trace; the cost is a new stateful component you must build, deploy, and keep available. This is the niche durable-execution engines like Temporal fill — they persist the orchestrator’s state so a step that crashes mid-saga resumes exactly where it stopped instead of losing the in-flight workflow.

▸Why this works

The rough heuristic: choreography for a short, stable, linear flow (2–3 steps that rarely change); orchestration once the flow has branches, retries, timeouts, or more than ~4 steps — the point where “where is the logic?” stops having a one-file answer. Many teams start choreographed for simplicity and migrate to an orchestrator when the event web becomes untraceable.

The brutal part: sagas have no isolation

This is the line that gets skipped in tutorials and discovered in production. A saga is ACID minus the I: it gives you Atomicity (via compensation), Consistency, and Durability, but no Isolation. Because each local transaction commits immediately, its half-finished state is visible to everyone else before the saga as a whole has decided to succeed or fail. Garcia-Molina and Salem named the pattern in 1987 precisely as the relaxation of isolation for long-lived transactions.

Three concrete anomalies follow. A dirty read: saga B reads an order saga A has committed but will later compensate away, then acts on data that is about to vanish. A lost update: two sagas read the same balance, both write, one clobbers the other. A non-repeatable read: a saga reads a value at step 1 and a different value at step 3 because another saga changed it in between. None of these can happen inside a single ACID transaction; all of them can happen across a saga’s lifetime, which may be seconds, or — for a workflow that waits on human approval — days.

Together, these three anomalies mean you are running without a safety net the database used to provide: every concurrent saga touching the same data is a potential race. Without explicit countermeasures, one of them silently wins and the other operates on state that no longer exists.

The countermeasures are application-level, not database-level. A semantic lock marks a record pending/PENDING_PAYMENT so other sagas know not to touch it until the saga clears it. Commutative updates (use balance += delta, never balance = newValue) make concurrent writes order-independent. A reread / version check verifies the value hasn’t changed before overwriting (optimistic concurrency). You are reimplementing a slice of what a database gave you for free — which is exactly why you only reach for a saga when a single transaction genuinely cannot span the work.

Pick the best fit

A 6-step order workflow spans 5 services, has retries, timeouts, and a manual-approval step that can wait days. Pick the coordination approach.

Quiz

A saga's step T2 charges a card; step T3 fails. What does C2 (compensating T2) actually do?

Quiz

Saga B reads an order that saga A committed but will later compensate away, and acts on it. Which anomaly is this, and why is it possible?

Order the steps

A trip saga books flight (T1), hotel (T2), car (T3). The car step fails. Order what happens:

1 T1 books flight, commits locally, emits an event
2 T2 books hotel, commits locally, emits an event
3 T3 attempts to rent car and fails
4 C2 runs: cancel the hotel (compensate the most recent completed step first)
5 C1 runs: cancel the flight (compensate in reverse order)

No global rollback: each forward step commits immediately, so failure triggers compensations in reverse order — C2 before C1 — because later steps may depend on earlier ones.

Recall before you leave

01
Explain to a teammate why a compensating transaction is not the same as a database rollback, and how that changes how you order the steps.
02
What does it mean that a saga is 'ACID minus I', and what do you do about it in production?

Recap

Two-phase commit is correct but unusable across microservices: it makes the coordinator a blocking single point of failure and holds locks across services for the length of a network round trip, so a coordinator crash leaves participants in-doubt. A saga gives that up. It is a sequence of local transactions, each committed immediately and each paired with a compensating transaction that semantically undoes it — and because compensations are new forward actions (a refund, not a delete) and some effects can’t be undone at all, you order the irreversible steps last. You wire the steps with choreography (services react to events; no single point of failure but the logic is scattered and hard to trace) or orchestration (a central, often durable, orchestrator owns the flow; one place to read and resume, at the cost of a new component). The defining tradeoff is that a saga is ACID minus Isolation: intermediate commits are visible, so dirty reads, lost updates, and non-repeatable reads become your problem, fixed at the application level with semantic locks, commutative updates, and rereads. Reach for a saga only when a single transaction truly cannot span the work — and then design the undo path before you ship the happy path. Now when you see a multi-service workflow in a design review, your first question is not “how do I commit all of this?” but “what does each compensation actually do, and what happens if T3 fires before I can undo T1?”

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.