Queues, Streams, Eventing QUE · 08 · 01

End-to-end order pipeline: where each delivery guarantee lives

Wire one order write through outbox, CDC, partitioned Kafka topics, idempotent consumer groups, a dead-letter queue, and an eventually-consistent UI — and see why idempotency is the one invariant that holds the whole thing together.

QUE Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A customer is charged twice for one order. The trace tells the story: the payment consumer processed the order.placed event, called the gateway, succeeded — and then the pod was killed mid-commit, before the Kafka offset moved. The group rebalanced, a new consumer picked up the same offset, and replayed the event. Two charges, one order. The gateway call had no idempotency key, so it had no way to know it was the same charge. The pipeline was correct everywhere except the one place that mattered.

One write, six hops, one guarantee that ties them

In the next ten minutes you will see exactly where that double charge happened, why no single “exactly-once” switch could have prevented it, and what one invariant actually closes the gap.

Follow a single order through an event-driven pipeline and you cross six boundaries, each with its own failure mode. The write happens once. Everything after is a relay across networks and processes that can each retry, crash, reorder, or stall. The capstone insight is that you do not get one global “exactly-once” knob. You choose a guarantee at each hop, and then you defend the seams with one invariant — idempotency — because every honest hop in this chain is at-least-once.

Only the write is exactly-once. Every hop after it is at-least-once, so an idempotent consumer is what makes replays safe.

The flow:

Hop	Mechanism	Guarantee it gives	What defends it
1. Order write	DB row + outbox row in one txn	Atomic: event exists iff order does	ACID transaction
2. Publish	CDC relay tails the WAL → Kafka	At-least-once to the topic	Resume from log offset
3. Route	Topic partitioned by order id	Ordering per key	Stable partition assignment
4. Process	Consumer group: payment / inventory / notify	At-least-once delivery	Idempotent handler + dedup
5. Quarantine	Retry with backoff → DLQ	Poison messages don’t block	Retry budget + DLQ alarm
6. Display	UI shows optimistic + pending	Eventual consistency, honestly	Pending state + reconcile

Hop 1–2: the write is atomic, the publish is not

The first decision is the famous dual-write problem: you must save the order and emit order.placed. If you write to the DB and then publish to Kafka as two separate calls, a crash between them either loses the event (order exists, nobody knows) or emits a phantom (event sent, order rolled back). The transactional outbox removes the gap: in one local DB transaction you write the order row and an outbox row. Either both commit or neither does. The event now exists if and only if the order does.

Publishing is a separate concern, and it is honestly at-least-once. A CDC relay (Debezium tailing the Postgres WAL is the canonical setup) reads committed outbox rows from the log and pushes them to Kafka. Its superpower is recoverability: it tracks its log offset, so on restart it resumes from the exact position — zero polling, near-zero DB load because it reads the log, not the table. But “resume from offset” means it can re-emit a row it published just before crashing. That is fine, and it is the whole point: you are not chasing exactly-once on the wire; you are pushing the dedup responsibility to the consumer, where it belongs.

▸Why this works

A senior team starts with the polling publisher (a worker that SELECTs unsent outbox rows and marks them sent) before reaching for CDC. It is trivial to build, debug, and operate, and it is correct. You migrate to log-tailing CDC only when polling latency or the load of the marker write becomes a real problem — not by default. The pattern is the same; only the relay changes.

Hop 3: partition by the key that owns the order

Ask yourself: what goes wrong if payment.authorized arrives before order.placed for the same order? The answer is the reason partitioning by key is not a performance trick but a correctness guarantee.

Kafka guarantees ordering only within a partition, and the partition is chosen by the record key. Key your order events by order_id (or customer_id when you need a customer’s events serialized) and every event for that order lands in the same partition, processed in commit order by one consumer. payment.authorized can never overtake order.placed for the same order. Cross-order ordering is neither guaranteed nor needed — that is exactly what lets you scale by adding partitions.

The trap is operational: the default partitioner is hash(key) % partitionCount. The moment you add partitions to scale, the modulus changes, the same key maps to a different partition, and in-flight ordering for that key breaks at the seam. Senior teams over-provision partitions up front, or migrate keys deliberately, rather than bumping partition count on a hot Friday. Partition count is a one-way door for key ordering.

Hop 4–5: at-least-once is a promise to deliver duplicates

The consumer group is where the load-bearing rule lives. To get at-least-once you commit the offset only after the handler succeeds — process, then commit. The cost is duplicates: if the pod dies after processing but before the commit (exactly the Hook), the next consumer replays that offset. At-least-once is not a bug to be fixed; it is a contract that says “I will deliver your message, possibly more than once.” The only correct response is an idempotent handler.

Idempotency means processing the same event twice has the same effect as once. The mechanism is a dedup key — the event id, or a business idempotency key — recorded in a dedup table with a unique constraint, inside the same transaction as the side effect. Re-delivery hits the constraint and is a no-op. For external calls, propagate the key downstream: Stripe’s Idempotency-Key header dedups for 24 hours, so passing your event’s key into the charge call means a retry returns the original charge instead of creating a second one. That single header is the difference between the Hook being a near-miss and a double charge.

Some failures are not transient — a malformed payload, a permanently-rejected card. Retrying those forever blocks the partition (head-of-line blocking) and lets lag grow without bound. So you give each message a retry budget with exponential backoff for transient errors, and on exhaustion route it to a dead-letter queue. Uber’s reliable-reprocessing design uses tiered retry topics feeding a DLQ precisely so a poison message is quarantined instead of stalling the live stream. The DLQ is not a graveyard; it is an inbox you must drain.

Pick the best fit

The payment consumer occasionally double-charges on retry after a crash. Where do you put the fix?

Hop 6: the UI must tell the truth about lag

Because every hop is asynchronous, the moment the user clicks “Place order” the answer is not yet known. Inventory and notification may settle hundreds of milliseconds later; under load they settle when consumer lag drains. A senior UI does not lie about this. It applies an optimistic update so the order appears instantly, marks it pending until a confirmation event arrives, and reconciles to confirmed or failed from the real state. The eventual-consistency window is a product surface, not a defect to hide.

This is also where observability earns its keep. The metrics that catch the silent killers are consumer lag (the gap between latest offset and committed offset — Burrow watches its trend over a sliding window, flagging OK / WARNING / ERR rather than a raw number), DLQ depth (a rising DLQ means handlers are failing faster than you drain), and end-to-end latency with a correlation id stamped on the event at hop 1 and carried through every async hop so one trace spans the whole journey. The Hook’s double charge is invisible without that trace. Lag growing until the DLQ overflows is invisible without those two gauges and an alarm on each.

Quiz

Your CDC relay restarts and re-emits a handful of outbox rows it had already published. Is the pipeline broken?

Quiz

Consumer lag has been climbing for an hour and the DLQ is filling. What does this combination most likely mean?

Order the steps

Order one order event's journey through the pipeline:

1 Write the order row and an outbox row in one DB transaction
2 CDC relay tails the WAL and publishes the committed row to Kafka
3 Topic is partitioned by order id, preserving per-order ordering
4 Consumer group processes it; idempotent handler dedups any replay
5 On exhausted retries, the poison message goes to the dead-letter queue
6 UI reconciles its optimistic pending row to confirmed or failed

The write is atomic exactly once (outbox); every hop after it is at-least-once, so the idempotent consumer is what makes replays safe. Messages that exhaust their retry budget branch to the DLQ instead of blocking the partition.

Recall before you leave

01
Walk through why idempotency, not exactly-once delivery, is the invariant that holds an event-driven order pipeline together.
02
Two metrics together — rising consumer lag and a filling dead-letter queue — describe a specific failure. Explain it and where observability has to live.

Recap

An event-driven order pipeline is six hops, and you choose a guarantee at each: the order write is atomic because the row and its outbox event commit in one transaction; the CDC relay publishes at-least-once by resuming from its log offset; the topic preserves ordering per key by partitioning on order id, which is also why bumping partition count is a one-way door that can break key ordering; the consumer group delivers at-least-once by committing offsets only after processing; poison messages that exhaust a retry budget go to a dead-letter queue instead of blocking the partition; and the UI tells the truth about the eventual-consistency window with optimistic-then-pending state. The thread through all of it is idempotency — a dedup key under a unique constraint, plus an Idempotency-Key propagated to external calls — because every honest hop is at-least-once and duplicates are the contract, not the bug. Observability lives at the seams: a correlation id across every async hop, consumer lag watched by trend, and an alarm on DLQ depth. Now when you see a double charge in production, you know the exact hop to look at — and when you see lag climbing alongside a filling DLQ, you know the handler is failing, not the broker.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.