awesome-everything RU
↑ Back to the climb

Networking & Protocols

Reconnection: jittered backoff, thundering herd, message resumption

Crux How to reconnect without crashing your own server — jittered exponential backoff, message IDs, and the at-least-once delivery guarantee with deduplication.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 12 min

Your server restarts for a deploy. All 10 million connected clients instantly get close code 1006. They all retry at exactly 1 second. Ten million SYN packets hit your load balancer simultaneously. It collapses. The server never recovers. WebSocket has no built-in reconnection — which means you are one naive retry loop away from killing your own service during a maintenance window.

What happens on disconnect

When a WebSocket connection drops — network glitch, server restart, proxy timeout — the client receives a close event with close code 1006 (abnormal closure — no close frame was received). The application must decide: reconnect, and if so, when and how.

WebSocket itself provides zero reconnection logic. The application owns this entirely.

The thundering herd problem

If every client retries at the same fixed interval after a disconnect, the retries arrive at the server in synchronized waves:

  • T=0: service goes down, 10 million clients disconnect.
  • T=1s: all 10 million clients retry simultaneously.
  • Server gets 10 million SYN packets in ~1 second.
  • TCP SYN backlog fills. New connections are dropped.
  • T=2s: all 10 million retry again at the same interval.
  • Server never gets a quiet moment to recover.

Even exponential backoff without jitter fails: the synchronized doubling still creates synchronized bursts.

Jittered exponential backoff

The fix is to add randomness (jitter) to each client’s retry timing so the 10 million retries spread over a window of seconds instead of arriving simultaneously.

Algorithm:

attempt = 0
base_ms = random(100, 500)      // different per client
max_ms  = 60_000

while not connected:
  delay = min(base_ms * 2^attempt, max_ms)
  delay = delay * random(0.5, 1.5)   // ±50% jitter
  sleep(delay)
  attempt += 1
  try_connect()

Example spread for a 100k-client server restart:

  • Client A picks base=150 ms, jitter → retries at 186 ms
  • Client B picks base=340 ms, jitter → retries at 412 ms
  • Client C picks base=210 ms, jitter → retries at 288 ms

The 100k retries are spread over 10+ seconds. The server processes them in small batches and recovers within 1–2 retry windows instead of being crushed.

Reconnection strategy parameters
Recommended base delay range
100–500 ms (random per client)
Jitter factor
±50% of computed delay
Max delay cap
30–60 seconds
Reconnect spread for 10M clients with jitter
10+ seconds
Reconnect spread without jitter (fixed 1s interval)
< 1 second (thundering herd)
Libraries shipping jittered backoff by default
grpc-go, Phoenix, @tensorflow/tfjs

Message resumption and at-least-once delivery

Reconnection re-establishes the connection. It does not restore lost messages. Any messages the server sent while the client was disconnected — or that the client sent but the server had not yet ACK’d — are gone unless the application tracks them.

The pattern:

  1. Every message gets a client-generated ID. The sender keeps the message in a retry queue until it receives an ACK for that ID.
  2. The server ACKs each message ID. The sender removes ACK’d messages from its retry queue.
  3. On reconnect, the client resends unACK’d messages. The server checks its Redis stream or database: if the message ID already exists, it sends ACK to the client but does not republish (prevents duplicates for other subscribers).

This implements at-least-once delivery with idempotent deduplication — no message is lost, no message is delivered more than once to other clients.

For the server to deduplicate, it needs durable state that survives restarts. The typical store:

  • Redis stream (fast, persistent, ordered, with configurable retention).
  • PostgreSQL (durable, queryable, suited for message history with audit requirements).

The client reconnect flow:

1. Reconnect with backoff.
2. On connection established, send { type: "resume", lastSeenId: "msg-4291" }.
3. Server reads the stream from msg-4292 onward and delivers missed messages.
4. Server ACKs the delivery.
5. Client confirms receipt and updates its lastSeenId.
Why this works

Why not session-layer resumption instead? Some protocols (QUIC, MPTCP) handle connection migration at the transport layer — the application is unaware of reconnects. WebSocket sits on TCP which has no migration. Rolling your own session layer (token + Redis resume state) adds 50–100 ms of latency on reconnect but works across any TCP/TLS WebSocket deployment. The choice is: simplicity + universality of the custom session layer vs. the migration-at-the-transport approach of WebTransport/QUIC (narrower browser and server support as of 2026).

Trace it
1/5

Trace a cascade failure on a WebSocket server after a brief network partition — with and without jitter.

1
Step 1 of 5
A 1 Gbps link hiccups for 5 seconds. All 100,000 WebSocket clients get close code 1006. What does the server see?
2
Locked
Without jitter: all 100k clients retry at 1 second fixed interval. What hits the server?
3
Locked
The server is still starting up when the second synchronized wave (at T=2s) arrives. What happens?
4
Locked
With jitter: clients pick base=100–500 ms, add ±50% jitter, double each attempt. Describe the first 10 seconds of reconnect traffic.
5
Locked
On reconnect, a client sends resume token with lastSeenId=msg-4291. The server's Redis stream has msg-4292 through msg-4300. What does the server do?
Quiz

You are designing a load balancer in front of a WebSocket cluster. A client sends a message to server A, then the network partitions and the client reconnects to server B. Why is sticky session routing (client always goes to the same server) safer than per-request load balancing, even though servers have pub/sub?

Quiz

A mobile WebSocket client loses connectivity, reconnects 45 seconds later, and sends a queued message with a message ID it had not yet received an ACK for. The server has a Redis stream retaining 24 hours of messages. What is the correct server action?

Recall before you leave
  1. 01
    Explain why exponential backoff without jitter still produces a thundering herd problem.
  2. 02
    What durable store does message resumption require on the server, and why must it survive server restarts?
  3. 03
    Describe the full reconnect flow with message resumption: what does the client send, what does the server return, and what dedup check prevents duplicate delivery?
Recap

WebSocket has no built-in reconnection mechanism — close code 1006 fires and the application must decide what to do. Without jitter, a mass disconnect causes a thundering herd: all clients retry at synchronized intervals, making SYN waves that prevent the server from recovering. Jittered exponential backoff (base 100–500 ms per client, ±50% random jitter, doubling each attempt, capped at 30–60 s) spreads 10 million retries over seconds — the standard approach in all major WebSocket libraries. Message resumption pairs a Redis stream (or database) with client-generated message IDs: the sender keeps unACK’d messages in a local queue, resends on reconnect, and the server deduplicates by checking the stream before publishing, achieving at-least-once delivery with no duplicates. Sticky session routing helps preserve in-flight state on the server the client was previously connected to, though full durability requires externalized state in Redis or a database.

Connected lessons
appears again in258
Continue the climb ↑WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10M
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.