awesome-everything RU
↑ Back to the climb

Queues, Streams, Eventing

End-to-end order pipeline: where each delivery guarantee lives

Crux Wire one order write through outbox, CDC, partitioned Kafka topics, idempotent consumer groups, a dead-letter queue, and an eventually-consistent UI — and see why idempotency is the one invariant that holds the whole thing together.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 17 min

A customer is charged twice for one order. The trace tells the story: the payment consumer processed the order.placed event, called the gateway, succeeded — and then the pod was killed mid-commit, before the Kafka offset moved. The group rebalanced, a new consumer picked up the same offset, and replayed the event. Two charges, one order. The gateway call had no idempotency key, so it had no way to know it was the same charge. The pipeline was correct everywhere except the one place that mattered.

One write, six hops, one guarantee that ties them

Follow a single order through an event-driven pipeline and you cross six boundaries, each with its own failure mode. The write happens once. Everything after is a relay across networks and processes that can each retry, crash, reorder, or stall. The capstone insight is that you do not get one global “exactly-once” knob. You choose a guarantee at each hop, and then you defend the seams with one invariant — idempotency — because every honest hop in this chain is at-least-once.

The flow:

HopMechanismGuarantee it givesWhat defends it
1. Order writeDB row + outbox row in one txnAtomic: event exists iff order doesACID transaction
2. PublishCDC relay tails the WAL → KafkaAt-least-once to the topicResume from log offset
3. RouteTopic partitioned by order idOrdering per keyStable partition assignment
4. ProcessConsumer group: payment / inventory / notifyAt-least-once deliveryIdempotent handler + dedup
5. QuarantineRetry with backoff → DLQPoison messages don’t blockRetry budget + DLQ alarm
6. DisplayUI shows optimistic + pendingEventual consistency, honestlyPending state + reconcile

Hop 1–2: the write is atomic, the publish is not

The first decision is the famous dual-write problem: you must save the order and emit order.placed. If you write to the DB and then publish to Kafka as two separate calls, a crash between them either loses the event (order exists, nobody knows) or emits a phantom (event sent, order rolled back). The transactional outbox removes the gap: in one local DB transaction you write the order row and an outbox row. Either both commit or neither does. The event now exists if and only if the order does.

Publishing is a separate concern, and it is honestly at-least-once. A CDC relay (Debezium tailing the Postgres WAL is the canonical setup) reads committed outbox rows from the log and pushes them to Kafka. Its superpower is recoverability: it tracks its log offset, so on restart it resumes from the exact position — zero polling, near-zero DB load because it reads the log, not the table. But “resume from offset” means it can re-emit a row it published just before crashing. That is fine, and it is the whole point: you are not chasing exactly-once on the wire; you are pushing the dedup responsibility to the consumer, where it belongs.

Why this works

A senior team starts with the polling publisher (a worker that SELECTs unsent outbox rows and marks them sent) before reaching for CDC. It is trivial to build, debug, and operate, and it is correct. You migrate to log-tailing CDC only when polling latency or the load of the marker write becomes a real problem — not by default. The pattern is the same; only the relay changes.

Hop 3: partition by the key that owns the order

Kafka guarantees ordering only within a partition, and the partition is chosen by the record key. Key your order events by order_id (or customer_id when you need a customer’s events serialized) and every event for that order lands in the same partition, processed in commit order by one consumer. payment.authorized can never overtake order.placed for the same order. Cross-order ordering is neither guaranteed nor needed — that is exactly what lets you scale by adding partitions.

The trap is operational: the default partitioner is hash(key) % partitionCount. The moment you add partitions to scale, the modulus changes, the same key maps to a different partition, and in-flight ordering for that key breaks at the seam. Senior teams over-provision partitions up front, or migrate keys deliberately, rather than bumping partition count on a hot Friday. Partition count is a one-way door for key ordering.

Hop 4–5: at-least-once is a promise to deliver duplicates

The consumer group is where the load-bearing rule lives. To get at-least-once you commit the offset only after the handler succeeds — process, then commit. The cost is duplicates: if the pod dies after processing but before the commit (exactly the Hook), the next consumer replays that offset. At-least-once is not a bug to be fixed; it is a contract that says “I will deliver your message, possibly more than once.” The only correct response is an idempotent handler.

Idempotency means processing the same event twice has the same effect as once. The mechanism is a dedup key — the event id, or a business idempotency key — recorded in a dedup table with a unique constraint, inside the same transaction as the side effect. Re-delivery hits the constraint and is a no-op. For external calls, propagate the key downstream: Stripe’s Idempotency-Key header dedups for 24 hours, so passing your event’s key into the charge call means a retry returns the original charge instead of creating a second one. That single header is the difference between the Hook being a near-miss and a double charge.

Some failures are not transient — a malformed payload, a permanently-rejected card. Retrying those forever blocks the partition (head-of-line blocking) and lets lag grow without bound. So you give each message a retry budget with exponential backoff for transient errors, and on exhaustion route it to a dead-letter queue. Uber’s reliable-reprocessing design uses tiered retry topics feeding a DLQ precisely so a poison message is quarantined instead of stalling the live stream. The DLQ is not a graveyard; it is an inbox you must drain.

Pick the best fit

The payment consumer occasionally double-charges on retry after a crash. Where do you put the fix?

Hop 6: the UI must tell the truth about lag

Because every hop is asynchronous, the moment the user clicks “Place order” the answer is not yet known. Inventory and notification may settle hundreds of milliseconds later; under load they settle when consumer lag drains. A senior UI does not lie about this. It applies an optimistic update so the order appears instantly, marks it pending until a confirmation event arrives, and reconciles to confirmed or failed from the real state. The eventual-consistency window is a product surface, not a defect to hide.

This is also where observability earns its keep. The metrics that catch the silent killers are consumer lag (the gap between latest offset and committed offset — Burrow watches its trend over a sliding window, flagging OK / WARNING / ERR rather than a raw number), DLQ depth (a rising DLQ means handlers are failing faster than you drain), and end-to-end latency with a correlation id stamped on the event at hop 1 and carried through every async hop so one trace spans the whole journey. The Hook’s double charge is invisible without that trace. Lag growing until the DLQ overflows is invisible without those two gauges and an alarm on each.

Quiz

Your CDC relay restarts and re-emits a handful of outbox rows it had already published. Is the pipeline broken?

Quiz

Consumer lag has been climbing for an hour and the DLQ is filling. What does this combination most likely mean?

Order the steps

Order one order event's journey through the pipeline:

  1. 1 Write the order row and an outbox row in one DB transaction
  2. 2 CDC relay tails the WAL and publishes the committed row to Kafka
  3. 3 Topic is partitioned by order id, preserving per-order ordering
  4. 4 Consumer group processes it; idempotent handler dedups any replay
  5. 5 On exhausted retries, the poison message goes to the dead-letter queue
  6. 6 UI reconciles its optimistic pending row to confirmed or failed
Recall before you leave
  1. 01
    Walk through why idempotency, not exactly-once delivery, is the invariant that holds an event-driven order pipeline together.
  2. 02
    Two metrics together — rising consumer lag and a filling dead-letter queue — describe a specific failure. Explain it and where observability has to live.
Recap

An event-driven order pipeline is six hops, and you choose a guarantee at each: the order write is atomic because the row and its outbox event commit in one transaction; the CDC relay publishes at-least-once by resuming from its log offset; the topic preserves ordering per key by partitioning on order id, which is also why bumping partition count is a one-way door that can break key ordering; the consumer group delivers at-least-once by committing offsets only after processing; poison messages that exhaust a retry budget go to a dead-letter queue instead of blocking the partition; and the UI tells the truth about the eventual-consistency window with optimistic-then-pending state. The thread through all of it is idempotency — a dedup key under a unique constraint, plus an Idempotency-Key propagated to external calls — because every honest hop is at-least-once and duplicates are the contract, not the bug. Observability lives at the seams: a correlation id across every async hop, consumer lag watched by trend, and an alarm on DLQ depth, so the two silent killers — a missing idempotency key that double-charges, and lag that grows until the DLQ overflows — are caught before a customer is.

Continue the climb ↑Queues capstone: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.