Queues, Streams, Eventing QUE · 01 · 06

Exactly-once in production: impossibility proof, hybrid patterns, and real incidents

The Two Generals proof of why exactly-once delivery is impossible, the hybrid Kafka+DB pattern that achieves effectively-once at scale, observability metrics, and real production failures from Stripe, AWS, and Slack.

QUE Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Stripe 2020: a bug in their internal queue consumer skipped the Idempotency-Key on retry, charging a small percentage of customers twice during a Q4 traffic spike. Refunded within 24 hours. The post-mortem: “exactly-once is harder than it looks; production correctness lives in consumer idempotency more than in broker guarantees.”

The Two Generals’ Problem: why exactly-once delivery is impossible

This lesson ties together everything from the unit — but it starts with a proof, not a recipe. If you have ever wondered why every senior engineer says “exactly-once is impossible” and still ships systems that behave correctly, this is where you will find the precise answer.

Two generals want to coordinate an attack at dawn. Their only communication is via messenger across enemy territory. Messengers may be captured. Can they agree?

Proof by contradiction: suppose a protocol exists. Consider the last message in the protocol that, if delivered, guarantees agreement. If we remove it, the protocol must still work — the sender already had certainty before sending that last message. By induction, the protocol works with zero messages — which is absurd, since they cannot coordinate without communicating.

The generalisation: no finite protocol over an unreliable channel can give both parties certainty that they share the same agreed fact.

Applied to queues: exactly-once delivery requires both producer and broker to know the message was delivered exactly once. Over an unreliable network, this is the Two Generals’ Problem. Exactly-once processing only requires one party — the consumer — to maintain a dedup log. One-sided certainty is achievable. Two-sided is not.

This asymmetry is why “effectively-once” is the real production target: at-least-once delivery from the broker, exactly-once processing enforced by the consumer.

The hybrid exactly-once pattern at scale

Pure Kafka exactly-once (within Kafka topics) does not extend to external systems. The production-grade pattern for a high-throughput payment processor:

Kafka with idempotent producer (enable.idempotence=true, acks=all). Eliminates producer-retry duplicates at ~3% cost.
Consumer at-least-once with isolation.level=read_committed. Sees only committed records.

Per-message dedup table in Postgres:

BEGIN;
  INSERT INTO charges (msg_id, status)
  VALUES ('msg-7a3f', 'pending')
  ON CONFLICT DO NOTHING;
  -- if 0 rows inserted: already processed, skip
COMMIT;
-- call Stripe with Idempotency-Key=msg-7a3f
UPDATE charges SET status='done', charge_id='ch_abc123'
WHERE msg_id='msg-7a3f';

Stripe Idempotency-Key derived from msg_id. Stripe stores the first response for 24 hours; any retry returns the cached response.

Each layer is independently testable. The system survives crashes at any point with no duplicate charges. The dedup table also serves as an audit trail for financial reconciliation.

Hybrid exactly-once: defence in depth

Kafka idempotent producerStops Leg 1 producer-retry duplicates

DB dedup + ON CONFLICTStops Leg 3 consumer-retry duplicates atomically

Stripe Idempotency-KeyStops cross-system duplicates at the payment API boundary

The broker delivers msg-7a3f more than once (at-least-once). The dedup boundary keyed on msg_id collapses the duplicate via ON CONFLICT DO NOTHING, so the side effect — DB upsert or Stripe call — fires exactly once: effectively-once.

Observability for delivery semantics

Without metrics, you find out about duplicates from angry customers, not dashboards.

Per-broker metrics:

consumer_lag_messages — publisher offset minus committed offset. Sustained growth means consumers are falling behind.
dlq_depth by source queue — alert at any non-zero value in production.
dlq_age_seconds_p99 — oldest message in DLQ; alert when >1 hour.

Per-consumer metrics:

dedup_check_hit_rate — how often the dedup INSERT is blocked by UNIQUE constraint. Should be near 0% in steady state. A spike indicates broker is redelivering aggressively — either a consumer-group rebalance, a bug causing early acks, or a broker issue.
side_effect_duration_p99 — if this exceeds your visibility timeout, you will see duplicates.
retry_count_p99 — sustained high values mean unhealthy consumers.

Distributed tracing: every message should carry a trace ID propagated from producer through consumer to any downstream service. Duplicate processing shows up as two spans with the same message trace ID — instantly distinguishable from legitimate distinct messages.

Real production failures

Stripe 2020: Internal queue consumer bug skipped Idempotency-Key on retry during Q4 spike. Small percentage of customers charged twice. Refund in 24 hours. Post-mortem result: hardened a linter rule that flags HTTP POST calls in consumer code paths that lack an Idempotency-Key header.

AWS SQS 2019: A region had a visibility-timeout race condition: consumers calling ChangeMessageVisibility while DeleteMessage was in-flight occasionally caused redelivery despite the delete completing. Fixed in next SDK release.

Stripe internal 2023: A hot DLQ accumulated 1M messages during a downstream incident. An operator ran redrive-all without rate limiting. Source queue received 1M messages simultaneously, overwhelming the consumer pool and triggering cascading throttling on the downstream payment API.

Slack 2022: A Kafka exactly-once stream pipeline had a transaction-coordinator bug during a broker rolling restart. The system delivered duplicates for 6 minutes until the rolling restart completed. Root cause: zombie producer sessions that had not been properly fenced.

The pattern in every incident: the correctness invariant failed in the consumer or the operational procedure, not in the broker guarantee itself. Broker guarantees are necessary but insufficient; end-to-end correctness requires consumer idempotency and operational discipline.

Every incident failed in the consumer or the operational procedure — never in the broker's delivery promise. That is the whole reason effectively-once lives in consumer idempotency, not broker guarantees.

Quiz

What is the actual reason Kafka exactly-once cannot extend natively to an external Postgres database?

Quiz

In steady state, a consumer's dedup_check_hit_rate suddenly spikes from 0.05% to 8% for 10 minutes outside a deploy window. What is the most likely cause?

Recall before you leave

01
State the Two Generals' Problem and explain the asymmetry that makes exactly-once processing achievable.
02
What three layers compose the hybrid exactly-once pattern for a payment processor?
03
What does a sustained dedup_hit_rate spike above 1% outside deploy windows indicate?

Recap

The Two Generals’ Problem proves that no finite protocol over an unreliable channel can give both sides certainty of agreement — this is why exactly-once delivery is impossible. Production systems achieve effectively-once through defence in depth: Kafka idempotent producer (~3% cost) eliminates producer-retry duplicates at the broker; consumer-side DB dedup with ON CONFLICT DO NOTHING eliminates Leg 3 duplicates atomically; Stripe Idempotency-Key eliminates cross-system duplicates at the payment API. Every real incident — Stripe 2020, SQS 2019, Slack 2022 — confirms the same pattern: correctness fails in the consumer or in operational procedure, not in the broker. Monitor dedup_hit_rate near zero in steady state; a sustained spike outside deploys means consumer instability, not a dedup failure. Now when someone on your team says “we just need exactly-once delivery from the broker”, you will know to redirect the conversation: the broker cannot give it, but the consumer can make duplicates harmless.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

appears again in228

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.