Queues, Streams, Eventing QUE · 01 · 05

SQS visibility timeout, DLQ, and the outbox pattern

How SQS visibility timeout works per-message, the heartbeat pattern for long-running consumers, DLQ redrive discipline, and the outbox pattern for reliable producer-side publishing.

QUE Middle ◷ 11 min

Level

FoundationsJuniorMiddleSenior

A DLQ silently accumulated 50,000 messages over three days while the on-call team watched a different dashboard. All of them were one-off edge-case orders that needed human review. When they finally ran redrive-all without a rate limit, the source queue received 50,000 messages in one second and the downstream service went down.

SQS visibility timeout mechanics

To understand why that DLQ stampede happened — and how to prevent it next time — you need to see exactly how SQS decides when a message is safe to redeliver. The mechanism is simpler than most engineers expect, which is precisely why it bites them.

SQS does not have an acknowledge/reject RPC. Instead, it uses a visibility timeout. When a consumer receives a message, SQS records:

A per-message timestamp
The consumer’s receipt handle (a unique token for this delivery)

For the duration of the visibility timeout, no other consumer can receive the message. If the consumer calls DeleteMessage with the matching receipt handle before the timeout, the message is gone. If the timeout fires first, the message becomes visible to all consumers again — at-least-once delivery.

Key pitfall: visibility timeout resets to the queue default on each redelivery, not to whatever value you set via ChangeMessageVisibility on the previous delivery. If your normal path extends to 90s but the queue default is 30s, every redelivered message starts with 30s and may expire again before your slow consumer finishes.

Rule of thumb: queue default visibility = 6x average processing time.

The heartbeat pattern for variable processing times

For consumers with processing times that vary widely (1s to 5 minutes depending on payload), the heartbeat pattern decouples the queue timeout from worst-case processing:

When the consumer receives a message, start a background heartbeat thread.
Every visibility_timeout / 3 seconds, call ChangeMessageVisibility(receipt_handle, new_timeout).
If the consumer crashes, heartbeats stop. The timeout expires naturally. The broker redelivers.
If the consumer finishes, cancel the heartbeat and DeleteMessage.

Together, steps 1–4 decouple the queue default timeout from the actual processing duration: without step 2, a slow payload redelivers mid-flight even when the consumer is healthy; without step 3, a genuinely crashed consumer holds the message until someone notices. The heartbeat is both a renewal mechanism and a liveness probe.

This means the queue default only needs to cover the time between ReceiveMessage and the first heartbeat — not the entire processing time. The heartbeat also detects deadlocked consumers: if the processing thread is stuck (not crashed), heartbeats stop eventually anyway, and a different worker recovers the message.

Dead-letter queues and redrive discipline

Without a DLQ, a poison-pill message — a payload that consistently crashes the consumer due to a bug or malformed data — blocks the queue forever. SQS redelivers it indefinitely, burning resources and preventing progress.

The DLQ is a separate queue where SQS moves messages after maxReceiveCount failed delivery attempts:

Recommended maxReceiveCount: 5–10. Setting it to 1 or 2 sends transient failures (flaky downstream, brief DB timeout) to DLQ immediately — almost everything looks like a poison pill. Setting it too high means thrashing on real poison pills for a long time.
DLQ retention: up to 14 days (SQS maximum). Messages older than retention are lost permanently.

Both extremes of maxReceiveCount have a distinct failure mode — too low quarantines transient errors as false poison pills, too high thrashes real poison pills and clogs the queue. 5–10 is the window that distinguishes them.

On failure (or visibility timeout) the message becomes visible again and receiveCount increments — SQS redelivers it. Once receiveCount exceeds maxReceiveCount, SQS routes it to the DLQ instead of the source queue, isolating the poison message.

DLQ redrive discipline

1Snapshot DLQ to S3 (manifest + messages) before any redrive

2Audit 20 random samples against the fixed code on staging

3Rate-limited redrive: 1–10 msgs/s via StartMessageMoveTask

!Never bulk-redrive without rate limiting — stampede risk

The outbox pattern: producer-side reliability

Even publishing a message to the broker is at-least-once: the producer may retry on timeout and the broker receives two copies. Worse, the dual-write problem: the application updates the DB and publishes a message as one logical operation. If the DB commits but the broker publish fails, the broker never sees the event. Silent data loss.

The outbox pattern fixes this with a transactional outbox table:

In the same DB transaction as the business update, INSERT a row into an outbox table: (id, payload, status='pending').
COMMIT: both the business update and the outbox row succeed atomically, or both roll back.
A separate Outbox Sender reads pending rows and publishes them to the broker. On success, marks the row status='sent'.
If the sender crashes after publish but before marking sent, the row stays pending. Next sender run re-publishes (duplicate), but the broker’s idempotent producer or consumer dedup handles it.

The CDC-based variant (Debezium reads the DB transaction log) is the modern form — no polling loop, lower latency, and works as long as the DB is online.

Quiz

An SQS consumer calls ChangeMessageVisibility to extend to 90s during processing. The consumer then crashes. When does SQS redeliver the message?

Quiz

An application updates a Postgres row and publishes a Kafka message in two separate operations. The DB commit succeeds but Kafka publish fails. What data state exists?

Order the steps

Order the steps of the outbox pattern for reliable event publishing:

1 Application receives request to update order status
2 BEGIN DB transaction
3 UPDATE orders SET status='paid' WHERE id=123
4 INSERT INTO outbox (payload='{order_paid, id:123}', status='pending')
5 COMMIT — both update and outbox row committed atomically
6 Outbox Sender reads pending row and publishes to Kafka
7 Outbox Sender marks row status='sent'

Recall before you leave

01
What happens to the visibility timeout value on SQS message redelivery?
02
What is the recommended maxReceiveCount for a production SQS DLQ and why not 1?
03
What is the dual-write problem and how does the outbox pattern solve it?

Recap

SQS visibility timeout is a per-message timer that resets to the queue default on each redelivery — set it to 6x average processing time, and use ChangeMessageVisibility heartbeats (every timeout/3 seconds) for variable workloads. Dead-letter queues quarantine poison pills after maxReceiveCount failures (use 5–10, not 1); redrive with a rate limit of 1–10 msgs/s to avoid stampeding the source queue. The outbox pattern solves the producer-side dual-write problem by INSERTing an event row in the same DB transaction as the business update — the row is the durable intent; a separate sender publishes from it, and any publish failure is retried from the still-pending row without losing the event. Now when you are handed a DLQ with thousands of messages to redrive, you will know to snapshot first, audit a sample on staging, and rate-limit to 1–10 msgs/s — not bulk-redrive and hope.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

The three failure legs — where duplicates and losses actually happenmiddle

deepens into

Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior

appears again in204

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

At-least-once job queueBuild a durable job queue on Postgres with visibility timeouts and idempotent consumers, so a crashed worker never drops a job.Job schedulerA cron + backoff job runner with at-least-once delivery, idempotent handlers, and visibility timeouts — so no job is silently lost even when workers crash mid-execution.