Queues, Streams, Eventing QUE · 08 · 10

Queues capstone: build an order-processing pipeline

Capstone build — design and implement an end-to-end order-processing pipeline applying outbox, Kafka, CDC, dead-letter queues, idempotent consumers, and observability, then prove each guarantee with a failure drill.

QUE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about double charges, rehashed partitions, and disk-filling slots is not the same as building a pipeline that survives them. Stand up a real order-processing pipeline — outbox, Kafka, CDC, consumer groups, a DLQ, and an honest UI — then deliberately crash each hop and prove the system stays correct. This is the whole track, assembled and stress-tested.

Goal

Turn every lesson in the track into one running system and one failure drill: make the write atomic, ship events at-least-once, key for per-order ordering, dedupe at every consumer, quarantine poison messages, tell the UI the truth, and instrument the seams — then break each hop and show the invariant holds.

Project

0 of 8

Objective

Design and build an end-to-end order-processing pipeline that places an order, fans it out to payment / inventory / notification consumers, survives a crash at every hop without double-charging or losing a write, quarantines poison messages, and shows the customer an honest pending-then-confirmed UI — proving each guarantee with a deliberate failure drill, not just a happy-path demo.

Requirements

Acceptance criteria

A diagram of the six hops (write, publish, route, process, quarantine, display) labelling the guarantee and the defence at each, matching what your code actually does.
A crash-before-commit drill on the payment consumer produces exactly one charge for one order — demonstrated with the gateway's request log, not asserted.
A poison message exhausts its retry budget and appears in the DLQ while the rest of the partition keeps flowing; the live consumer lag does not grow unbounded.
Stalling the CDC consumer triggers the slot-lag alarm before the primary's disk is endangered, and max_slot_wal_keep_size is set so a runaway slot is invalidated rather than fatal.
The UI shows an order as pending immediately, never a premature success, and reconciles to the real state; a refetch during the consistency window does not erase the user's own order.
A one-page write-up: where each delivery guarantee lives, why idempotency is the load-bearing invariant, and which failures your observability would catch in production.

Senior stretch

Add a customer-facing reconciliation job that scans for orders stuck in pending past a deadline and resolves them from authoritative state — closing the infinite-spinner / lost-confirmation gap.
Add exactly-once-to-Kafka on the CDC side (Kafka transactions) and prove it still requires idempotent consumers end-to-end by replaying from the DLQ.
Introduce a hot-key skew (one whale customer) and show it pins one partition; mitigate at the key level (salt or split the hot entity) rather than adding partitions.
Add a second relay implementation and an A/B drill: run polling and CDC side by side, compare end-to-end latency and DB load under the same order rate, and record when each is the right choice.
Add a runbook: triage from the four gauges (lag, DLQ depth, e2e latency, slot lag), the common causes for each, the fix-priority order, and a verification checklist for after a fix.

Recap

This is the system you will actually design in a review: an atomic outbox write, an at-least-once relay (polling, then CDC) you alert on and cap, topics keyed for per-order ordering, idempotent consumers that commit after processing, a retry budget feeding a DLQ for poison messages, an honest pending-then-confirmed UI, and observability stamped across every seam. The failure drill is the point — anyone can demo the happy path, but a senior proves the invariant by crashing each hop and showing one order still produces one charge, one poison message is quarantined not amplified, and one stalled slot pages before it fills the disk. Build it once end to end and the production version becomes a checklist instead of an incident.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.