awesome-everything RU
↑ Back to the climb

Backend Architecture

Outbox and inbox: effectively-once across the dual-write boundary

Crux Direct dual-write to DB + Kafka breaks in either order. The outbox pattern writes the event inside the DB transaction; the inbox pattern deduplicates on the consumer side. Together they give effectively-once delivery.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 16 min

An order service writes the order row to Postgres, then publishes an “order.created” event to Kafka. Between the commit and the publish, the pod crashes. The order exists in the database. The downstream inventory service never sees the event. The order is stuck.

Why dual-write always fails

Any sequence of “write to database, then publish to message broker” has a crash window between the two:

SequenceCrash pointResult
Write DB → publish KafkaAfter DB, before KafkaDB updated, Kafka silent — consumers miss the event
Publish Kafka → write DBAfter Kafka, before DBPhantom event in Kafka — consumers see a write that never happened

Both sequences are broken. There is no version of “two separate writes” that is atomic.

The outbox pattern: use the database as the broker

Key insight: write the event to the database, inside the same transaction as the business write. The database becomes the single source of truth.

BEGIN;
  -- Business write
  INSERT INTO orders (id, customer_id, total) VALUES ($1, $2, $3);

  -- Outbox write (same transaction)
  INSERT INTO outbox (id, event_type, payload, published)
  VALUES (gen_random_uuid(), 'order.created', $payload, false);
COMMIT;
-- Kafka publish happens AFTER the transaction commits

A separate outbox-relay process polls the outbox table (or tails the Write-Ahead Log via Debezium) and publishes unpublished rows to Kafka, marking them published on broker ACK.

Crash scenarios are now safe:

  • Crash before COMMIT → transaction rolls back. Neither the order nor the outbox row exists. No phantom event.
  • Crash after COMMIT, before relay publishes → order exists, outbox row published=false. Relay picks it up on next poll.
  • Kafka is down → relay retries. Outbox accumulates. Order is safe in DB.
ApproachCrash-safe?Kafka required?Pattern
DB then KafkaNo — silent event lossYesAnti-pattern
Kafka then DBNo — phantom eventYesAnti-pattern
Outbox (DB only)YesNo (async relay)Correct

Relay implementations

Poll-based relay: queries WHERE published = false ORDER BY id LIMIT 1000 every few seconds. Simple, adds latency equal to poll interval (typically 1–5 s).

Debezium (CDC relay): tails the Postgres Write-Ahead Log via logical replication. Publishes as soon as the row commits. Sub-second lag. No polling overhead. Production standard for high-throughput services.

AWS SAM variant: DynamoDB Streams → Lambda → EventBridge — managed CDC without running a relay process.

The inbox pattern: deduplicate on the consumer

The relay delivers at-least-once (it may publish twice if it restarts after publishing but before marking the row). The consumer must be idempotent.

Inbox pattern: before applying the event’s business effect, write the event’s id into a processed_events table inside the same transaction as the business write.

BEGIN;
  -- Check if already processed
  INSERT INTO processed_events (event_id) VALUES ($1)
    ON CONFLICT (event_id) DO NOTHING;

  -- If 0 rows affected, this event was already processed — skip
  IF found THEN
    UPDATE inventory SET reserved = reserved - $qty WHERE sku = $sku;
  END IF;
COMMIT;

If the event arrives twice, the second attempt hits the UNIQUE constraint on event_id and the business write is skipped. Effectively-once behavior from the consumer’s perspective.

Inbox table maintenance: partition by event_timestamp and drop partitions older than the broker’s retention window (7–14 days). Without cleanup, the inbox grows forever.

Dead-letter queues: handling unrecoverable failures

Not every failure is recoverable by retry. Malformed data, schema breaks, or violated business invariants are poison messages — no amount of retry will fix them.

After N processing attempts (typically 3–5), move the message to a dead-letter queue (DLQ). The main pipeline keeps flowing; humans review the DLQ.

Main queue → [N retry attempts] → DLQ → human review
                                       → manual replay (after fix)
                                       → formal rejection (business rule)

Production settings: N = 3–5 attempts, DLQ retention 7–30 days, alert on DLQ depth growth. A growing DLQ is a compliance liability — every entry is a record of an operation that may have partial effects.

Why this works

Why is the outbox-relay’s at-least-once delivery acceptable even for payment events? Because the consumer runs the inbox pattern — it deduplicates by event ID before applying any business effect. The composition is: Postgres atomicity (guarantees the outbox row exists if and only if the business write committed) + at-least-once relay + idempotent consumer = effectively-once. No single component needs to be exactly-once; the guarantee emerges from the composition.

Quiz

Why is the outbox pattern necessary if the application can publish to Kafka directly from the API handler?

Order the steps

Put the outbox publish flow in correct order:

  1. 1 Application opens a database transaction
  2. 2 Application performs the business write (e.g., insert order row)
  3. 3 Same transaction inserts an outbox row recording the event-to-be-published
  4. 4 Application commits — business write and outbox row visible atomically
  5. 5 Outbox-relay process scans for unpublished outbox rows and publishes to Kafka
  6. 6 Relay marks the outbox row as published after receiving broker ACK
Quiz

The outbox relay crashes after publishing to Kafka but before marking the outbox row as published. What happens when the relay restarts?

Recall before you leave
  1. 01
    A team wants exactly-once delivery to Kafka from a Postgres write. They consider: (a) write Postgres then publish, (b) publish then write Postgres, (c) outbox. Why is (c) correct?
  2. 02
    What is the inbox pattern and what does it protect against?
  3. 03
    When should a message go to the dead-letter queue instead of being retried?
Recap

The dual-write problem — writing to two systems atomically — cannot be solved by sequencing the writes. The outbox pattern solves it by writing the event into the database as an outbox row inside the same transaction as the business write, making atomicity the database’s responsibility. A relay (poll-based or Debezium CDC) delivers events at-least-once to Kafka. The consumer uses the inbox pattern — deduplicating by event ID within its own transaction — to achieve effectively-once processing. Poison messages that cannot be processed go to a dead-letter queue after N attempts; the main pipeline continues.

Connected lessons
appears again in179
Continue the climb ↑Concurrency and cache architecture for idempotency at scale
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.