Backend Architecture BE · 05 · 04

Outbox and inbox: effectively-once across the dual-write boundary

Direct dual-write to DB + Kafka breaks in either order. The outbox pattern writes the event inside the DB transaction; the inbox pattern deduplicates on the consumer side. Together they give effectively-once delivery.

BE Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

An order service writes the order row to Postgres, then publishes an “order.created” event to Kafka. Between the commit and the publish, the pod crashes. The order exists in the database. The downstream inventory service never sees the event. The order is stuck.

Why dual-write always fails

If you have ever written “save to DB, then publish to Kafka” and thought “that should be fine,” this section is for you — because there is always a crash window between the two.

Any sequence of “write to database, then publish to message broker” has a crash window between the two:

Sequence	Crash point	Result
Write DB → publish Kafka	After DB, before Kafka	DB updated, Kafka silent — consumers miss the event
Publish Kafka → write DB	After Kafka, before DB	Phantom event in Kafka — consumers see a write that never happened

Both sequences are broken. There is no version of “two separate writes” that is atomic.

The outbox pattern: use the database as the broker

Key insight: write the event to the database, inside the same transaction as the business write. The database becomes the single source of truth.

BEGIN;
  -- Business write
  INSERT INTO orders (id, customer_id, total) VALUES ($1, $2, $3);

  -- Outbox write (same transaction)
  INSERT INTO outbox (id, event_type, payload, published)
  VALUES (gen_random_uuid(), 'order.created', $payload, false);
COMMIT;
-- Kafka publish happens AFTER the transaction commits

A separate outbox-relay process polls the outbox table (or tails the Write-Ahead Log via Debezium) and publishes unpublished rows to Kafka, marking them published on broker ACK.

Crash scenarios are now safe:

Crash before COMMIT → transaction rolls back. Neither the order nor the outbox row exists. No phantom event.
Crash after COMMIT, before relay publishes → order exists, outbox row published=false. Relay picks it up on next poll.
Kafka is down → relay retries. Outbox accumulates. Order is safe in DB.

Approach	Crash-safe?	Kafka required?	Pattern
DB then Kafka	No — silent event loss	Yes	Anti-pattern
Kafka then DB	No — phantom event	Yes	Anti-pattern
Outbox (DB only)	Yes	No (async relay)	Correct

Relay implementations

Poll-based relay: queries WHERE published = false ORDER BY id LIMIT 1000 every few seconds. Simple, adds latency equal to poll interval (typically 1–5 s).

Debezium (CDC relay): tails the Postgres Write-Ahead Log via logical replication. Publishes as soon as the row commits. Sub-second lag. No polling overhead. Production standard for high-throughput services.

AWS SAM variant: DynamoDB Streams → Lambda → EventBridge — managed CDC without running a relay process.

Both deliver at-least-once; CDC trades a WAL-tailing dependency for sub-second lag and zero polling load — the default once throughput is high.

The inbox pattern: deduplicate on the consumer

The relay delivers at-least-once (it may publish twice if it restarts after publishing but before marking the row). The consumer must be idempotent.

Inbox pattern: before applying the event’s business effect, write the event’s id into a processed_events table inside the same transaction as the business write.

BEGIN;
  -- Check if already processed
  INSERT INTO processed_events (event_id) VALUES ($1)
    ON CONFLICT (event_id) DO NOTHING;

  -- If 0 rows affected, this event was already processed — skip
  IF found THEN
    UPDATE inventory SET reserved = reserved - $qty WHERE sku = $sku;
  END IF;
COMMIT;

If the event arrives twice, the second attempt hits the UNIQUE constraint on event_id and the business write is skipped. Effectively-once behavior from the consumer’s perspective.

Inbox table maintenance: partition by event_timestamp and drop partitions older than the broker’s retention window (7–14 days). Without cleanup, the inbox grows forever.

Dead-letter queues: handling unrecoverable failures

When you see a consumer retry the same message indefinitely without ever succeeding, that message is a poison pill. Retrying it is not recovery — it is a loop. The DLQ is the escape hatch.

Not every failure is recoverable by retry. Malformed data, schema breaks, or violated business invariants are poison messages — no amount of retry will fix them.

After N processing attempts (typically 3–5), move the message to a dead-letter queue (DLQ). The main pipeline keeps flowing; humans review the DLQ.

Main queue → [N retry attempts] → DLQ → human review
                                       → manual replay (after fix)
                                       → formal rejection (business rule)

Production settings: N = 3–5 attempts, DLQ retention 7–30 days, alert on DLQ depth growth. A growing DLQ is a compliance liability — every entry is a record of an operation that may have partial effects.

▸Why this works

Why is the outbox-relay’s at-least-once delivery acceptable even for payment events? Because the consumer runs the inbox pattern — it deduplicates by event ID before applying any business effect. The composition is: Postgres atomicity (guarantees the outbox row exists if and only if the business write committed) + at-least-once relay + idempotent consumer = effectively-once. No single component needs to be exactly-once; the guarantee emerges from the composition.

Quiz

Why is the outbox pattern necessary if the application can publish to Kafka directly from the API handler?

Order the steps

Put the outbox publish flow in correct order:

1 Application opens a database transaction
2 Application performs the business write (e.g., insert order row)
3 Same transaction inserts an outbox row recording the event-to-be-published
4 Application commits — business write and outbox row visible atomically
5 Outbox-relay process scans for unpublished outbox rows and publishes to Kafka
6 Relay marks the outbox row as published after receiving broker ACK

Quiz

The outbox relay crashes after publishing to Kafka but before marking the outbox row as published. What happens when the relay restarts?

Crash before COMMIT: both rows vanish — no phantom event. Crash after COMMIT: relay retries; inbox dedup ensures the effect runs exactly once.

Recall before you leave

01
A team wants exactly-once delivery to Kafka from a Postgres write. They consider: (a) write Postgres then publish, (b) publish then write Postgres, (c) outbox. Why is (c) correct?
02
What is the inbox pattern and what does it protect against?
03
When should a message go to the dead-letter queue instead of being retried?

Recap

The dual-write problem — writing to two systems atomically — cannot be solved by sequencing the writes. The outbox pattern solves it by writing the event into the database as an outbox row inside the same transaction as the business write, making atomicity the database’s responsibility. A relay (poll-based or Debezium CDC) delivers events at-least-once to Kafka. The consumer uses the inbox pattern — deduplicating by event ID within its own transaction — to achieve effectively-once processing. Poison messages that cannot be processed go to a dead-letter queue after N attempts; the main pipeline continues. Now when you see “write DB then publish Kafka” in a code review, flag it: there is always a crash window, and the outbox is the fix.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Why idempotency: making retries safejunior

unlocks

Observability, production failures, and global-scale designsenior

deepens into

Observability, production failures, and global-scale designsenior

appears again in205

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Presigned upload flowDirect-to-storage uploads via presigned URLs with size/content-type limits and a completion webhook that verifies the object actually arrived — so your API server never touches file bytes.