Networking & Protocols NET · 08 · 05

Reconnection: jittered backoff, thundering herd, message resumption

How to reconnect without crashing your own server — jittered exponential backoff, message IDs, and the at-least-once delivery guarantee with deduplication.

NET Senior ◷ 12 min

Level

FoundationsJuniorMiddleSenior

Your server restarts for a deploy. All 10 million connected clients instantly get close code 1006. They all retry at exactly 1 second. Ten million SYN packets hit your load balancer simultaneously. It collapses. The server never recovers. WebSocket has no built-in reconnection — which means you are one naive retry loop away from killing your own service during a maintenance window.

What happens on disconnect

When a WebSocket connection drops — network glitch, server restart, proxy timeout — the client receives a close event with close code 1006 (abnormal closure — no close frame was received). The application must decide: reconnect, and if so, when and how.

WebSocket itself provides zero reconnection logic. The application owns this entirely.

The thundering herd problem

Ask yourself: if your server restarted right now, what would 10 million clients do in the next second? If the answer is “retry simultaneously,” you have already built the attack. Here is how synchronized retries destroy a recovering server.

If every client retries at the same fixed interval after a disconnect, the retries arrive at the server in synchronized waves:

T=0: service goes down, 10 million clients disconnect.
T=1s: all 10 million clients retry simultaneously.
Server gets 10 million SYN packets in ~1 second.
TCP SYN backlog fills. New connections are dropped.
T=2s: all 10 million retry again at the same interval.
Server never gets a quiet moment to recover.

Even exponential backoff without jitter fails: the synchronized doubling still creates synchronized bursts.

Jittered exponential backoff

The fix is to add randomness (jitter) to each client’s retry timing so the 10 million retries spread over a window of seconds instead of arriving simultaneously.

Algorithm:

attempt = 0
base_ms = random(100, 500)      // different per client
max_ms  = 60_000

while not connected:
  delay = min(base_ms * 2^attempt, max_ms)
  delay = delay * random(0.5, 1.5)   // ±50% jitter
  sleep(delay)
  attempt += 1
  try_connect()

Example spread for a 100k-client server restart:

Client A picks base=150 ms, jitter → retries at 186 ms
Client B picks base=340 ms, jitter → retries at 412 ms
Client C picks base=210 ms, jitter → retries at 288 ms

The 100k retries are spread over 10+ seconds. The server processes them in small batches and recovers within 1–2 retry windows instead of being crushed.

Same disconnect, same client count: jitter stretches the retry window roughly 10× so the server gets small absorbable batches instead of one SYN spike.

Reconnection strategy parameters

Recommended base delay range: 100–500 ms (random per client)
Jitter factor: ±50% of computed delay
Max delay cap: 30–60 seconds
Reconnect spread for 10M clients with jitter: 10+ seconds
Reconnect spread without jitter (fixed 1s interval): < 1 second (thundering herd)
Libraries shipping jittered backoff by default: grpc-go, Phoenix, @tensorflow/tfjs

Message resumption and at-least-once delivery

Reconnection re-establishes the connection. It does not restore lost messages. Any messages the server sent while the client was disconnected — or that the client sent but the server had not yet ACK’d — are gone unless the application tracks them.

The pattern:

Every message gets a client-generated ID. The sender keeps the message in a retry queue until it receives an ACK for that ID.
The server ACKs each message ID. The sender removes ACK’d messages from its retry queue.
On reconnect, the client resends unACK’d messages. The server checks its Redis stream or database: if the message ID already exists, it sends ACK to the client but does not republish (prevents duplicates for other subscribers).

This implements at-least-once delivery with idempotent deduplication — no message is lost, no message is delivered more than once to other clients.

For the server to deduplicate, it needs durable state that survives restarts. The typical store:

Redis stream (fast, persistent, ordered, with configurable retention).
PostgreSQL (durable, queryable, suited for message history with audit requirements).

The client reconnect flow:

1. Reconnect with backoff.
2. On connection established, send { type: "resume", lastSeenId: "msg-4291" }.
3. Server reads the stream from msg-4292 onward and delivers missed messages.
4. Server ACKs the delivery.
5. Client confirms receipt and updates its lastSeenId.

▸Why this works

Why not session-layer resumption instead? Some protocols (QUIC, MPTCP) handle connection migration at the transport layer — the application is unaware of reconnects. WebSocket sits on TCP which has no migration. Rolling your own session layer (token + Redis resume state) adds 50–100 ms of latency on reconnect but works across any TCP/TLS WebSocket deployment. The choice is: simplicity + universality of the custom session layer vs. the migration-at-the-transport approach of WebTransport/QUIC (narrower browser and server support as of 2026).

Trace it

1/5

Trace a cascade failure on a WebSocket server after a brief network partition — with and without jitter.

Step 1 of 5

A 1 Gbps link hiccups for 5 seconds. All 100,000 WebSocket clients get close code 1006. What does the server see?

Locked

Without jitter: all 100k clients retry at 1 second fixed interval. What hits the server?

Locked

The server is still starting up when the second synchronized wave (at T=2s) arrives. What happens?

Locked

With jitter: clients pick base=100–500 ms, add ±50% jitter, double each attempt. Describe the first 10 seconds of reconnect traffic.

Locked

On reconnect, a client sends resume token with lastSeenId=msg-4291. The server's Redis stream has msg-4292 through msg-4300. What does the server do?

Quiz

You are designing a load balancer in front of a WebSocket cluster. A client sends a message to server A, then the network partitions and the client reconnects to server B. Why is sticky session routing (client always goes to the same server) safer than per-request load balancing, even though servers have pub/sub?

Quiz

A mobile WebSocket client loses connectivity, reconnects 45 seconds later, and sends a queued message with a message ID it had not yet received an ACK for. The server has a Redis stream retaining 24 hours of messages. What is the correct server action?

Jitter spreads millions of retries across seconds to avoid a thundering herd; the lastSeenId resume replays only missed messages, giving at-least-once delivery without duplicates.

Recall before you leave

01
Explain why exponential backoff without jitter still produces a thundering herd problem.
02
What durable store does message resumption require on the server, and why must it survive server restarts?
03
Describe the full reconnect flow with message resumption: what does the client send, what does the server return, and what dedup check prevents duplicate delivery?

Recap

WebSocket has no built-in reconnection mechanism — close code 1006 fires and the application must decide what to do. Without jitter, a mass disconnect causes a thundering herd: all clients retry at synchronized intervals, making SYN waves that prevent the server from recovering. Jittered exponential backoff (base 100–500 ms per client, ±50% random jitter, doubling each attempt, capped at 30–60 s) spreads 10 million retries over seconds — the standard approach in all major WebSocket libraries. Message resumption pairs a Redis stream (or database) with client-generated message IDs: the sender keeps unACK’d messages in a local queue, resends on reconnect, and the server deduplicates by checking the stream before publishing, achieving at-least-once delivery with no duplicates. Sticky session routing helps preserve in-flight state on the server the client was previously connected to, though full durability requires externalized state in Redis or a database. Now when you plan a deployment window for a service with millions of WebSocket clients, jittered backoff is not a nice-to-have — it is the only thing standing between your maintenance window and a self-inflicted outage.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

deepens into

WebSocket in production: proxies, security, and distributed architecturesenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.