Queues, Streams, Eventing QUE · 01 · 02

The three failure legs — where duplicates and losses actually happen

How producer-to-broker, broker durability, and broker-to-consumer failures each produce duplicates or loss, traced through a real double-charge incident.

QUE Middle ◷ 10 min

Level

FoundationsJuniorMiddleSenior

An on-call engineer is staring at a graph showing 1,200 Stripe charges for 900 orders. The queue has been running for six months without issue. Tonight it charged customers twice. Nothing in the code changed. The failure was always there — waiting for the right combination of slow Lambda and SQS visibility timeout.

The three failure legs

If you have ever wondered exactly where a duplicate slips through — not in theory, but at which network boundary — these are the three places to look. Knowing which leg produced a duplicate tells you which fix to apply.

A message travels across three distinct communication boundaries. Each boundary can fail in ways that produce either loss or duplicates.

Leg 1 — Producer to broker. The producer sends a message and waits for a broker ack. If the network drops the send, the message never arrives. If the network drops the ack, the producer never learns the message was stored — it retries, and the broker receives the message a second time. At-least-once producers retry on timeout; this is a feature, not a bug, but it means the broker can see the same logical message more than once.

Leg 2 — Broker durability. The broker acks the producer but must persist the message before it crashes. If replication lag exists when the broker crashes, followers may not have the message. Proper configuration (Kafka acks=all, SQS durability by default) covers this leg, but misconfiguration silently drops messages here.

Leg 3 — Broker to consumer. This is where most production duplicates happen. The broker delivers a message to the consumer. The consumer processes it (calls Stripe, updates the DB), then sends an ack. If the consumer crashes after processing but before acking, the broker has no record of success — it redelivers the message to the next consumer. The side effect runs twice.

Three failure legs

Producer → BrokerAck lost → producer retries → broker sees message twice

Broker durabilityCrash before replicate → message lost (config failure)

Broker → ConsumerConsumer processes then crashes before ack → broker redelivers

All three legs produce duplicates, but only Leg 3 does so after the side effect has already run — which is why it causes real-world double charges. Legs 1 and 2 produce duplicates or losses before any business logic executes, making them easier to absorb with broker-level config. Without an idempotent consumer, Leg 3 is uncatchable from the broker’s perspective.

Legs 1 and 2 fail before business logic runs, so broker config absorbs them; Leg 3 fails after the side effect already ran, so only the consumer side can catch it.

Leg 3 is worth tracing one level deeper, because the order of process and ack decides whether a fault becomes a duplicate or a loss. The diagram below follows the consume→process→ack lifecycle and marks where each failure strikes.

Happy path runs left to right. Crash after processing but before acking, or a visibility timeout firing mid-process, both redeliver → duplicate. Acking before processing (at-most-once ordering) then crashing loses the work entirely.

Tracing the double-charge incident

Here is the exact SQS visibility-timeout scenario that trips engineers every six months.

T=0s: Lambda receives msg-7a3f (“charge $50 for order O-123”). SQS hides the message from other consumers for the visibility timeout (default 30s).
T=0.5s: Lambda calls Stripe. Stripe charges the card. charge_id=ch_abc123 returned.
T=29s: Lambda is still running (slow downstream DB write). It has not yet called DeleteMessage.
T=30s: SQS visibility timeout fires. msg-7a3f becomes visible to all consumers again — it has no idea the first Lambda succeeded.
T=30.1s: A second Lambda picks up msg-7a3f. Calls Stripe. Second charge: ch_xyz789.
T=31s: First Lambda finally calls DeleteMessage. Message gone. But the damage is done: customer charged $100 for a $50 order.

The failure is not a bug. It is the defined behaviour of at-least-once delivery when visibility timeout is shorter than processing time.

Quiz

A consumer reads a message, processes it successfully, but crashes before acking. What does at-least-once delivery do?

Quiz

An SQS consumer's visibility timeout is 10s but average processing time is 30s. What is the failure mode?

Order the steps

Order the events in the double-charge incident trace:

1 Lambda receives msg-7a3f; SQS sets visibility timeout 30s
2 Lambda calls Stripe successfully — charge ch_abc123 created
3 Visibility timeout expires; msg-7a3f becomes visible to all consumers
4 Second Lambda receives msg-7a3f; calls Stripe again — charge ch_xyz789 created
5 First Lambda finally calls DeleteMessage — message removed, but two charges exist

Recall before you leave

01
What happens at Leg 3 when a consumer crashes after processing but before acking?
02
What is the SQS visibility timeout rule of thumb and why?
03
Why does losing the ack at Leg 1 cause duplicates, not loss?

Recap

Every message queue has three failure legs: producer-to-broker (lost ack causes producer retry), broker durability (replication misconfiguration causes loss), and broker-to-consumer (processing succeeds but ack lost causes redelivery). The most common production duplicate is at Leg 3: consumer processes a Stripe charge, crashes before acking SQS, visibility timeout fires, a second consumer picks up the same message and charges again. The structural fix is an idempotent consumer with a dedup store, plus a visibility timeout set to at least 6x average processing time to prevent concurrent duplicate processing. Now when you see a double charge or a duplicate event in your data, you can immediately ask: was the ack lost after a successful process — or did the visibility timeout fire mid-flight? That question points you to the right leg and the right fix.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

At-most-once, at-least-once, exactly-once: the three delivery contractsjunior

unlocks

deepens into

appears again in204

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

At-least-once job queueBuild a durable job queue on Postgres with visibility timeouts and idempotent consumers, so a crashed worker never drops a job.Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.