awesome-everything RU
↑ Back to the climb

Queues, Streams, Eventing

SQS visibility timeout, DLQ, and the outbox pattern

Crux How SQS visibility timeout works per-message, the heartbeat pattern for long-running consumers, DLQ redrive discipline, and the outbox pattern for reliable producer-side publishing.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 11 min

A DLQ silently accumulated 50,000 messages over three days while the on-call team watched a different dashboard. All of them were one-off edge-case orders that needed human review. When they finally ran redrive-all without a rate limit, the source queue received 50,000 messages in one second and the downstream service went down.

SQS visibility timeout mechanics

SQS does not have an acknowledge/reject RPC. Instead, it uses a visibility timeout. When a consumer receives a message, SQS records:

  • A per-message timestamp
  • The consumer’s receipt handle (a unique token for this delivery)

For the duration of the visibility timeout, no other consumer can receive the message. If the consumer calls DeleteMessage with the matching receipt handle before the timeout, the message is gone. If the timeout fires first, the message becomes visible to all consumers again — at-least-once delivery.

Key pitfall: visibility timeout resets to the queue default on each redelivery, not to whatever value you set via ChangeMessageVisibility on the previous delivery. If your normal path extends to 90s but the queue default is 30s, every redelivered message starts with 30s and may expire again before your slow consumer finishes.

Rule of thumb: queue default visibility = 6x average processing time.

The heartbeat pattern for variable processing times

For consumers with processing times that vary widely (1s to 5 minutes depending on payload), the heartbeat pattern decouples the queue timeout from worst-case processing:

  1. When the consumer receives a message, start a background heartbeat thread.
  2. Every visibility_timeout / 3 seconds, call ChangeMessageVisibility(receipt_handle, new_timeout).
  3. If the consumer crashes, heartbeats stop. The timeout expires naturally. The broker redelivers.
  4. If the consumer finishes, cancel the heartbeat and DeleteMessage.

This means the queue default only needs to cover the time between ReceiveMessage and the first heartbeat — not the entire processing time. The heartbeat also detects deadlocked consumers: if the processing thread is stuck (not crashed), heartbeats stop eventually anyway, and a different worker recovers the message.

Dead-letter queues and redrive discipline

Without a DLQ, a poison-pill message — a payload that consistently crashes the consumer due to a bug or malformed data — blocks the queue forever. SQS redelivers it indefinitely, burning resources and preventing progress.

The DLQ is a separate queue where SQS moves messages after maxReceiveCount failed delivery attempts:

  • Recommended maxReceiveCount: 5–10. Setting it to 1 or 2 sends transient failures (flaky downstream, brief DB timeout) to DLQ immediately — almost everything looks like a poison pill. Setting it too high means thrashing on real poison pills for a long time.
  • DLQ retention: up to 14 days (SQS maximum). Messages older than retention are lost permanently.
DLQ redrive discipline
1Snapshot DLQ to S3 (manifest + messages) before any redrive
2Audit 20 random samples against the fixed code on staging
3Rate-limited redrive: 1–10 msgs/s via StartMessageMoveTask
!Never bulk-redrive without rate limiting — stampede risk

The outbox pattern: producer-side reliability

Even publishing a message to the broker is at-least-once: the producer may retry on timeout and the broker receives two copies. Worse, the dual-write problem: the application updates the DB and publishes a message as one logical operation. If the DB commits but the broker publish fails, the broker never sees the event. Silent data loss.

The outbox pattern fixes this with a transactional outbox table:

  1. In the same DB transaction as the business update, INSERT a row into an outbox table: (id, payload, status='pending').
  2. COMMIT: both the business update and the outbox row succeed atomically, or both roll back.
  3. A separate Outbox Sender reads pending rows and publishes them to the broker. On success, marks the row status='sent'.
  4. If the sender crashes after publish but before marking sent, the row stays pending. Next sender run re-publishes (duplicate), but the broker’s idempotent producer or consumer dedup handles it.

The CDC-based variant (Debezium reads the DB transaction log) is the modern form — no polling loop, lower latency, and works as long as the DB is online.

Quiz

An SQS consumer calls ChangeMessageVisibility to extend to 90s during processing. The consumer then crashes. When does SQS redeliver the message?

Quiz

An application updates a Postgres row and publishes a Kafka message in two separate operations. The DB commit succeeds but Kafka publish fails. What data state exists?

Order the steps

Order the steps of the outbox pattern for reliable event publishing:

  1. 1 Application receives request to update order status
  2. 2 BEGIN DB transaction
  3. 3 UPDATE orders SET status='paid' WHERE id=123
  4. 4 INSERT INTO outbox (payload='{order_paid, id:123}', status='pending')
  5. 5 COMMIT — both update and outbox row committed atomically
  6. 6 Outbox Sender reads pending row and publishes to Kafka
  7. 7 Outbox Sender marks row status='sent'
Recall before you leave
  1. 01
    What happens to the visibility timeout value on SQS message redelivery?
  2. 02
    What is the recommended maxReceiveCount for a production SQS DLQ and why not 1?
  3. 03
    What is the dual-write problem and how does the outbox pattern solve it?
Recap

SQS visibility timeout is a per-message timer that resets to the queue default on each redelivery — set it to 6x average processing time, and use ChangeMessageVisibility heartbeats (every timeout/3 seconds) for variable workloads. Dead-letter queues quarantine poison pills after maxReceiveCount failures (use 5–10, not 1); redrive with a rate limit of 1–10 msgs/s to avoid stampeding the source queue. The outbox pattern solves the producer-side dual-write problem by INSERTing an event row in the same DB transaction as the business update — the row is the durable intent; a separate sender publishes from it, and any publish failure is retried from the still-pending row without losing the event.

Connected lessons
appears again in178
Continue the climb ↑Exactly-once in production: impossibility proof, hybrid patterns, and real incidents
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.