awesome-everything RU
↑ Back to the climb

Queues, Streams, Eventing

The three failure legs — where duplicates and losses actually happen

Crux How producer-to-broker, broker durability, and broker-to-consumer failures each produce duplicates or loss, traced through a real double-charge incident.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 10 min

An on-call engineer is staring at a graph showing 1,200 Stripe charges for 900 orders. The queue has been running for six months without issue. Tonight it charged customers twice. Nothing in the code changed. The failure was always there — waiting for the right combination of slow Lambda and SQS visibility timeout.

The three failure legs

A message travels across three distinct communication boundaries. Each boundary can fail in ways that produce either loss or duplicates.

Leg 1 — Producer to broker. The producer sends a message and waits for a broker ack. If the network drops the send, the message never arrives. If the network drops the ack, the producer never learns the message was stored — it retries, and the broker receives the message a second time. At-least-once producers retry on timeout; this is a feature, not a bug, but it means the broker can see the same logical message more than once.

Leg 2 — Broker durability. The broker acks the producer but must persist the message before it crashes. If replication lag exists when the broker crashes, followers may not have the message. Proper configuration (Kafka acks=all, SQS durability by default) covers this leg, but misconfiguration silently drops messages here.

Leg 3 — Broker to consumer. This is where most production duplicates happen. The broker delivers a message to the consumer. The consumer processes it (calls Stripe, updates the DB), then sends an ack. If the consumer crashes after processing but before acking, the broker has no record of success — it redelivers the message to the next consumer. The side effect runs twice.

Three failure legs
Producer → BrokerAck lost → producer retries → broker sees message twice
Broker durabilityCrash before replicate → message lost (config failure)
Broker → ConsumerConsumer processes then crashes before ack → broker redelivers

Tracing the double-charge incident

Here is the exact SQS visibility-timeout scenario that trips engineers every six months.

  1. T=0s: Lambda receives msg-7a3f (“charge $50 for order O-123”). SQS hides the message from other consumers for the visibility timeout (default 30s).
  2. T=0.5s: Lambda calls Stripe. Stripe charges the card. charge_id=ch_abc123 returned.
  3. T=29s: Lambda is still running (slow downstream DB write). It has not yet called DeleteMessage.
  4. T=30s: SQS visibility timeout fires. msg-7a3f becomes visible to all consumers again — it has no idea the first Lambda succeeded.
  5. T=30.1s: A second Lambda picks up msg-7a3f. Calls Stripe. Second charge: ch_xyz789.
  6. T=31s: First Lambda finally calls DeleteMessage. Message gone. But the damage is done: customer charged $100 for a $50 order.

The failure is not a bug. It is the defined behaviour of at-least-once delivery when visibility timeout is shorter than processing time.

Quiz

A consumer reads a message, processes it successfully, but crashes before acking. What does at-least-once delivery do?

Quiz

An SQS consumer's visibility timeout is 10s but average processing time is 30s. What is the failure mode?

Order the steps

Order the events in the double-charge incident trace:

  1. 1 Lambda receives msg-7a3f; SQS sets visibility timeout 30s
  2. 2 Lambda calls Stripe successfully — charge ch_abc123 created
  3. 3 Visibility timeout expires; msg-7a3f becomes visible to all consumers
  4. 4 Second Lambda receives msg-7a3f; calls Stripe again — charge ch_xyz789 created
  5. 5 First Lambda finally calls DeleteMessage — message removed, but two charges exist
Recall before you leave
  1. 01
    What happens at Leg 3 when a consumer crashes after processing but before acking?
  2. 02
    What is the SQS visibility timeout rule of thumb and why?
  3. 03
    Why does losing the ack at Leg 1 cause duplicates, not loss?
Recap

Every message queue has three failure legs: producer-to-broker (lost ack causes producer retry), broker durability (replication misconfiguration causes loss), and broker-to-consumer (processing succeeds but ack lost causes redelivery). The most common production duplicate is at Leg 3: consumer processes a Stripe charge, crashes before acking SQS, visibility timeout fires, a second consumer picks up the same message and charges again. The structural fix is an idempotent consumer with a dedup store, plus a visibility timeout set to at least 6x average processing time to prevent concurrent duplicate processing.

Connected lessons
appears again in178
Continue the climb ↑Consumer-side dedup: the cheapest path to exactly-once processing
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.