Networking & Protocols
WebSocket: survive the broadcast storm
Reading about backpressure and thundering herds is not the same as pulling a real-time service out of one. Build a broadcast server, drive it into both failure modes on purpose, and apply the unit’s fixes until the numbers come back — with evidence at every step.
Turn the unit’s mental model into a reproducible engineering loop: stand up a WebSocket broadcast server, reproduce the backpressure OOM and the reconnect storm, defend against each with a high-water mark and jittered backoff, then fan out across two servers with pub/sub — measuring before and after.
Build a WebSocket broadcast service (chat or live-feed), deliberately drive it into a backpressure OOM and a reconnect thundering herd, then harden it with a per-connection high-water mark, jittered exponential backoff, and a Redis pub/sub fan-out — proving each fix with before/after measurements under identical load.
- A before/after table for backpressure: total queued bytes, process RSS, slow-client count, and p99 broadcast latency under identical load — measured, not estimated — showing RSS flat after the high-water mark instead of climbing to OOM.
- A before/after comparison for reconnection: SYN/accept rate over time and time-to-full-recovery, showing the jittered version spreading retries across a window and recovering, versus the synchronized version stalling.
- A demonstration that the Origin check rejects an unauthorized origin (403) and that an oversized message is rejected with close code 1009.
- A short write-up naming, for each fix, which lever from the unit you used (high-water mark, jitter, pub/sub fan-out, sticky sessions) and why it was the highest-leverage choice.
- Add message IDs plus a Redis stream so a reconnecting client sends its last-seen ID and the server replays only missed messages — implement at-least-once delivery with no duplicates and prove it by killing a connection mid-stream.
- Add an on-call runbook: triage from the five panels, the close-code cheat sheet (1006 spike = network event, 1013 = eviction), and the fix-priority ladder.
- Compare transports: implement the same feed over SSE and measure per-client overhead and reconnection behaviour against the WebSocket version, documenting when SSE would have been the better default.
- Move the two servers behind an HTTP/2 extended CONNECT (RFC 8441) capable proxy and measure the kernel-buffer and handshake-latency savings versus HTTP/1.1 Upgrade.
This is the loop you will run in every real-time incident: stand up the service, reproduce the failure on purpose, fix at the right lever (high-water mark for backpressure, jitter for the reconnect herd, pub/sub for cross-server fan-out, sticky sessions for state locality), and verify with before/after numbers under identical load. Doing it once on a broadcast toy makes the production version — Discord-scale chat, a trading feed, a collaborative editor — muscle memory.