Queues, Streams, Eventing
Outbox pattern: build a crash-safe event pipeline
Reading about the dual-write gap is not the same as watching an event vanish and then closing the hole yourself. Build a small order service, prove it loses events the naive way, then add a transactional outbox plus an idempotent consumer and show — with crashes injected on purpose — that nothing is lost and nothing is applied twice.
Turn the unit’s mental model into a working pipeline: write the business row and an outbox row in one transaction, ship rows with a crash-safe polling relay, scale the relay without double-publishing, and verify the at-least-once guarantee holds end to end under injected failures.
Build an order service that publishes an OrderPlaced event for every committed order with zero lost events under crash injection, using a transactional outbox and a polling relay, and prove the at-least-once guarantee with an idempotent consumer that never double-applies.
- A before/after comparison: the naive dual write loses at least one event under crash injection; the outbox version loses zero across the same crash schedule.
- Evidence (logs or a counter) that the relay delivered at least one event more than once under crash injection, AND that the side effect was still applied exactly once — proving the consumer's idempotency.
- With multiple relay replicas running, no event id is published twice in the same pass — SKIP LOCKED claims disjoint batches.
- A short write-up: where the dual-write gap was, why one local transaction closes it, why delivery is still at-least-once, and how the consumer dedupes.
- Swap the polling relay for a CDC relay (Debezium tailing the WAL, or your DB's logical replication) and measure the end-to-end latency drop versus polling at a 500ms interval.
- Add outbox cleanup that does not hurt the write path: either batched deletes of old sent rows or daily partitions reaped with DROP PARTITION, and show the unsent-row poll stays fast as the table grows.
- Preserve per-aggregate ordering: key events by aggregate id so all events for one order land on one partition and one worker, and demonstrate ordering holds even with multiple relays.
- Add an inbox-retention sweep (drop processed event ids older than N hours) and reason about the window: too short risks re-applying a very late duplicate, too long bloats the dedupe table.
This is the pipeline you will reach for whenever a write must reliably trigger a notification: prove the naive dual write loses events, then write the business row and an outbox row in one local transaction so the intent to publish is durable, ship rows with a relay that claims disjoint batches via SKIP LOCKED, and make consumers idempotent on a stable event id so the unavoidable at-least-once duplicate is a no-op. Build it once with crashes injected on purpose, and the production version — never silently dropping a write — becomes muscle memory.