Queues, Streams, Eventing QUE · 05 · 01

The transactional outbox: ending the dual-write between DB and broker

You cannot atomically write your database and publish to a broker. The outbox writes the business row and an event row in one local transaction, then a relay ships the event — at-least-once, so consumers must be idempotent.

QUE Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

An order is placed. The handler does two things: INSERT the order row, then kafka.publish("OrderPlaced"). Both succeed in the happy path for months. Then one night the broker has a 400ms blip; the publish call hangs and the pod gets OOM-killed mid-call. The order row is committed and visible — the customer was charged — but OrderPlaced never landed. Fulfillment never hears about it. The package never ships. There is no error in any log, because nothing threw. Reconciliation finds it a week later, by hand.

Why the dual write has no safe order

You have two systems with no shared transaction: your database and your message broker. Any handler that must update one and notify the other is doing a dual write, and there is no ordering of those two calls that survives a crash in the gap between them.

Write the row, then publish: crash after commit, before publish, and you have lost an event. The state changed but nobody downstream knows. Publish, then write the row: crash after publish, before commit, and you have a phantom event — consumers react to an order that was rolled back and never existed. Wrap both in a try/catch and “compensate” on failure, and you have just written a tiny, buggy distributed-transaction coordinator that itself can crash between steps. Two-phase commit across a DB and Kafka would close the gap, but Kafka doesn’t offer XA participation you’d want in production, and 2PC’s blocking coordinator is a worse problem than the one you started with.

The dual write is not a bug you can code around inside the handler. It is the absence of atomicity across two stores.

One transaction, one source of truth

The outbox pattern collapses the dual write into a single write. In the same local database transaction that changes the business state, you also INSERT a row into an outbox table describing the event. Because both writes are in one ACID transaction against one database, they commit together or roll back together — atomically, for free, with no coordinator.

BEGIN;
  INSERT INTO orders (id, status, ...) VALUES (...);
  INSERT INTO outbox (id, aggregate, type, payload, created_at)
    VALUES (uuid, 'order', 'OrderPlaced', '{...}', now());
COMMIT;

After commit, the truth that “an OrderPlaced event must be delivered” is now durable in the same database as the order itself. Publishing to the broker becomes a separate, retryable step performed by a relay (a.k.a. message relay / dispatcher) that reads unsent outbox rows and pushes them to Kafka. The handler’s job is done at COMMIT; it never talks to the broker at all. You have traded an impossible atomic dual-write for a trivially-atomic single write plus an asynchronous, crash-safe pump.

The orders row and the outbox row are written in ONE local transaction (atomic). A separate relay then reads unsent outbox rows, publishes to the broker, and marks them sent — publish-then-mark makes delivery at-least-once, so the consumer dedupes.

Two relay flavors: polling vs log-tailing

The relay reads the outbox in one of two ways, and choosing between them is the senior decision in this pattern.

Polling publisher — a worker runs SELECT ... WHERE sent = false ORDER BY created_at LIMIT N on an interval, publishes each row, then marks it sent. Dead simple, works on any database, no extra infrastructure. The costs: every poll is load on your primary OLTP database whether or not there’s work, and end-to-end latency is bounded by your interval. Intervals of 100ms–1s are typical; pushing to 25–50ms to chase freshness multiplies query load, and at 10–30 relay replicas that constant polling becomes a real burden on the primary.

Change Data Capture (CDC) — instead of querying the table, a tool like Debezium tails the database’s write-ahead log (the WAL in Postgres) and emits an event the moment the outbox INSERT is committed. Latency drops to single-digit milliseconds, there is zero polling load on the table, and you get the log’s natural commit ordering. The price is operational: you now run and monitor a CDC connector, manage replication slots, and reason about WAL retention. CDC is where the next unit picks up.

Dimension	Polling publisher	CDC / log-tailing
End-to-end latency	Bounded by interval (100ms–1s)	Single-digit ms
Load on primary DB	Constant query traffic, even when idle	Reads the WAL, near-zero table load
Ordering	By `ORDER BY`; lost with parallel relays	Natural commit order from the log
Operational cost	A cron-like worker; trivial	A connector, replication slots, WAL retention
Best when	Modest volume, latency tolerant, no new infra	High throughput, low latency, ordering matters

The guarantee is at-least-once — so consumers must be idempotent

The outbox closes the lost-event gap, but it does not give you exactly-once delivery, and believing it does is the failure mode that bites in production. The relay does two non-atomic things: publish to the broker, then mark the row sent. If it crashes after the broker accepted the message but before the UPDATE commits, the row still looks unsent — so on restart it publishes the same event again. You get at-least-once: every event is delivered one or more times.

That means the contract for everyone downstream is non-negotiable: consumers must be idempotent. Stamp each outbox row with a stable event id, carry it through the broker, and have consumers dedupe on it — typically an inbox table of processed ids (kept for a few hours to a couple of days) or a unique constraint on the side effect. Processing OrderPlaced twice must charge the card once. Without idempotency, the outbox doesn’t make your system reliable; it makes it reliably double-fire.

▸Why this works

Couldn’t the relay mark the row sent first, then publish? No — that just moves the gap and turns a duplicate into a lost event: crash after the UPDATE and before the publish, and the row looks done but nothing was ever sent. The only orderings available are publish-then-mark (duplicates, recoverable) or mark-then-publish (loss, unrecoverable). At-least-once is the strictly safer choice, which is why idempotent consumers are the load-bearing half of the pattern, not an optional nicety.

The relay's two non-atomic steps have only two orderings, and their failure modes are asymmetric: publish-then-mark risks a recoverable duplicate, mark-then-publish risks an unrecoverable lost event. That asymmetry is why the outbox is deliberately at-least-once and why idempotent consumers are mandatory.

The operational tax: bloat, ordering, and competing relays

Three things will eventually page someone:

Table bloat. The outbox grows with every event. Sent rows must be reaped or the table — and its indexes — balloon into the hundreds of thousands of rows where polling queries degrade. Don’t DELETE huge ranges in one statement (it fights inserts for locks); delete sent rows in small batches, or partition the outbox by day so cleanup is a metadata-only DROP PARTITION instead of index-thrashing row deletes.
Ordering. A single relay preserves created_at order. The moment you scale to multiple relays for throughput, two workers can publish concurrently and events arrive out of order. If a consumer needs per-aggregate order, key by aggregate id so all events for one order go to the same partition and one worker.
Competing relays. Run two relay replicas naively and both SELECT the same unsent rows and double-publish. The fix is SELECT ... FOR UPDATE SKIP LOCKED (Postgres / MySQL 8.0+): each worker grabs and locks a disjoint batch, skipping rows another worker already holds, so replicas scale out without stepping on each other.

Together these three pressures share a root: the outbox is a shared mutable queue, and every shared queue eventually faces contention, ordering, and growth. Ignore any one of them and you trade a silent lost-event bug for a noisier but equally real operational failure.

Pick the best fit

An order service must update the DB and notify fulfillment via Kafka. A crash in the gap loses the event. Pick the design.

Quiz

The relay publishes a message to the broker, then crashes before it can UPDATE the row to sent. What happens on restart?

Quiz

You scale the polling relay to three replicas for throughput. What must change so they don't double-publish the same rows?

Order the steps

Order the lifecycle of one event through the outbox pattern:

1 In one local transaction, INSERT the business row AND an outbox row, then COMMIT
2 The relay reads unsent outbox rows (polling on an interval, or tailing the WAL via CDC)
3 The relay publishes each event to the broker
4 The relay marks the row sent (the gap here makes delivery at-least-once)
5 The idempotent consumer dedupes on the event id and applies the effect once

Recall before you leave

01
Explain to a teammate why writing the DB and publishing to Kafka in the same handler is unsafe, and how the outbox fixes it.
02
Why is the outbox at-least-once rather than exactly-once, and what does that force on the rest of the system?

Recap

A handler that updates your database and publishes to a broker is doing a dual write across two systems with no shared transaction, and no ordering of those calls survives a crash in the gap — you either lose an event or emit a phantom one. The transactional outbox removes the dual write entirely: in the same local transaction that changes business state, you INSERT an outbox row, so both commit atomically against one database with no coordinator. A separate relay then ships outbox rows to the broker, either by polling on a 100ms–1s interval (simple, but adds latency and constant DB load) or by tailing the WAL with CDC like Debezium (single-digit-ms latency and natural ordering, at the cost of running a connector). Because the relay publishes then marks sent in two non-atomic steps, delivery is at-least-once, so consumers must dedupe on a stable event id. The operational tax is real — reap or partition the outbox to fight bloat, key by aggregate to preserve order, and use FOR UPDATE SKIP LOCKED so competing relays don’t double-publish — but in exchange you get an event pipeline that never silently loses a write. Now when you see a handler that touches a DB and a broker in the same function, you know exactly where to look for the gap — and you know how to close it.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Job schedulerA cron + backoff job runner with at-least-once delivery, idempotent handlers, and visibility timeouts — so no job is silently lost even when workers crash mid-execution.