Queues, Streams, Eventing
Delivery guarantees: build a crash-proof payment consumer
Reading about double-charges is not the same as stopping one. Build a small payment consumer on an at-least-once queue, drive it into every failure leg with injected crashes and timeout misconfigurations, and harden it until the duplicate count is exactly zero — with evidence at every step.
Turn the unit’s mental model into a reproducible engineering loop: reproduce duplicates and loss on purpose, add an INSERT-first transactional dedup plus an Idempotency-Key, close the producer-side dual-write with an outbox, and prove effectively-once under chaos with before/after numbers.
Build a payment-processing consumer on an at-least-once queue (SQS, RabbitMQ, or Kafka) that achieves effectively-once: zero double-charges and zero lost events under injected consumer crashes, visibility-timeout expiry, and producer-side failures — proven with measured duplicate and loss counts, not assertions.
- A before/after table: double-charge count and lost-event count, measured under an identical chaos run (kill consumer mid-processing, expire the timeout, drop a publish) — naive version vs hardened version. Hardened must show zero of both.
- A demonstration that killing the consumer AFTER the DB commit but BEFORE the ack causes a redelivery that is silently deduplicated (UNIQUE violation -> rollback -> ack), with the dedup_hit_rate metric registering the catch.
- A demonstration that a dropped/failed publish on the producer side does NOT lose the event, because the outbox row stays pending and the sender republishes it.
- A short write-up mapping each fix to the failure leg it closes (Leg 1 / Leg 3 / dual-write / timeout) and naming why consumer idempotency — not a broker setting — is the load-bearing guarantee.
- Add Kafka idempotent producer + transactions on the within-Kafka path and measure the throughput cost (~3% vs ~20-30%); then show the cross-system Postgres write STILL needs the consumer dedup, proving where the Kafka transaction boundary ends.
- Build a chaos harness that randomly kills the consumer at each step (before charge, after charge before commit, after commit before ack) on a loop for 1000 messages, and assert charges-per-order == 1 for every order at the end.
- Add a one-page on-call runbook: how to read a dedup_hit_rate spike, the SQS visibility-timeout rule, the DLQ redrive checklist (snapshot, sample-audit, rate-limit), and the duplicate-vs-loss decision tree.
- Swap the outbox poller for CDC (Debezium reading the WAL) and compare latency and operational load against the polling sender.
This is the loop you will run in every real delivery-guarantee incident: reproduce the duplicate or loss on purpose, identify the failure leg, apply the structural fix (INSERT-first transactional dedup, stable Idempotency-Key, outbox for dual-write, timeout sized to processing), and verify with measured before/after counts under chaos — never assertions. Build it once on a toy payment consumer and the production version becomes muscle memory: at-least-once delivery, effectively-once processing, correctness enforced in the consumer.