Networking & Protocols NET · 08 · 07

WebSocket in production: proxies, security, and distributed architecture

How to configure Nginx and ALB for long-lived upgrades, harden against DoS and cache-poisoning, and scale WebSocket clusters with pub/sub and sticky sessions.

NET Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

You deploy your WebSocket server. Everything works in testing. In production, connections randomly drop after 60 seconds of silence, browsers report 1006s during idle periods, and your load balancer occasionally returns 502 on the upgrade. None of these are bugs in your application. They are proxy misconfigurations that are invisible until you know where to look.

Proxy and load-balancer misconfigurations

Proxies like Nginx, HAProxy, and AWS ALB were designed for HTTP — short-lived request-response conversations measured in milliseconds. A persistent WebSocket connection is alien to them. Common misconfigurations:

Problem 1 — Idle timeout closes quiet connections.

Nginx default: proxy_read_timeout 60s (closes if no data in 60 seconds).
AWS ALB default: idle_timeout.timeout_seconds = 60.
Fix: raise to at least 3,600 seconds (1 hour).

location /ws {
  proxy_pass http://backend;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "Upgrade";
  proxy_read_timeout 3600s;
  proxy_send_timeout 3600s;
  proxy_buffering off;
}

The 60-second default idle timeout is the cliff: raise read/send to 3600s and ping every 25–30s so traffic always arrives under the limit.

Problem 2 — L7 buffering delays the 101 response. Nginx buffers HTTP responses by default. The 101 Switching Protocols response is held in the buffer until it is “complete” — but a WebSocket never completes its response. This can delay the upgrade by hundreds of milliseconds. Fix: proxy_buffering off.

Problem 3 — Proxy doesn’t understand the Upgrade header. Some proxies strip or ignore Connection: Upgrade headers. The backend sees the request without Upgrade headers and returns a regular HTTP response instead of 101. Fix: explicitly set both headers as shown above.

Problem 4 — HTTP/2 proxy doesn’t forward WebSocket upgrades. If the proxy is HTTP/2 but the backend is HTTP/1.1, the proxy may not know to handle WebSocket extended CONNECT. Fix: test WebSocket through the proxy stack explicitly during deployment, not just HTTP traffic.

Proxy configuration checklist

Nginx: raise read/send timeout: proxy_read_timeout 3600s
AWS ALB: raise idle timeout: idle_timeout.timeout_seconds = 3600
Nginx: disable response buffering: proxy_buffering off
Server ping interval to defeat proxy idle timers: every 25–30 s
ALB sticky session (for stateful WS servers): Enable target group stickiness
Origin header check (server-side): Mandatory — block unauthorized origins

Security: Origin checks, DoS, and slow-reader attacks

Origin check. The browser always sends an Origin header on WebSocket upgrades. The server must validate it:

if request.headers["Origin"] not in ALLOWED_ORIGINS:
    respond 403 Forbidden
    return

A WebSocket client outside the browser does not send an Origin header by default — if your server requires it, non-browser clients must add it. This is the primary defense against cross-site WebSocket hijacking (CSWSH), where a malicious page on attacker.com opens a WebSocket to api.yoursite.com using the victim’s cookies.

Rate limiting at the handshake. Botnets can exhaust the TCP SYN backlog with connection attempts. Rate-limit at the load balancer: maximum connection attempts per IP per second (typically 5–20). Apply TCP SYN cookies at the OS level (net.ipv4.tcp_syncookies = 1).

Slow-reader (Slow Loris) attack. A malicious client completes the WebSocket handshake but never reads from its socket. The server’s send queue fills for that connection. Mitigation: close connections with no data activity within 30 seconds (no message sent or received, no pong reply to a ping).

Per-message size limits. A client sending a 100 MB message forces the server to buffer 100 MB per subscriber. Enforce a maximum message size (typically 64 KB–1 MB) and close the connection with code 1009 (“message too big”) if exceeded.

Horizontal scaling: pub/sub and sticky sessions

A single WebSocket server hits its scale ceiling at 500 k–2 M connections. Beyond that, you need multiple servers with a shared messaging backbone.

Pub/sub (Redis Streams or RabbitMQ). Each server subscribes to the relevant channel(s). When the application publishes a message, all subscribed servers broadcast it to their local connected clients:

User A (connected to Server 1) sends a chat message.
Server 1 publishes { roomId, message, msgId } to Redis stream "room:42".
Server 2 (also subscribed to "room:42") reads from the stream.
Server 2 broadcasts to its locally connected users in room 42.

This decouples senders from receivers and allows horizontal scale without requiring all clients to be on the same server.

Sticky sessions. When a load balancer routes the same client to the same server across reconnects, in-flight state (unACK’d messages, partial subscriptions) is preserved without requiring full Redis replication. AWS ALB implements this as target group stickiness (1-hour cookie by default). The downside: server failures send all sticky clients to reconnect to other servers simultaneously — a mini thundering herd per failed instance.

▸Why this works

Why the hardest part is not WebSocket but state management. The WebSocket protocol is straightforward. The hard engineering problem is: what happens when a user’s connection migrates to a different server during a horizontal scale event, a deploy, or a server failure? All the in-flight state — subscriptions, partial uploads, game sync position — must either live in a shared external store (Redis, database) or be re-established from scratch via the reconnect + message-resumption protocol. Discord, Slack, and all chat-scale services spend more engineering time on state replication and consistency under failures than on the WebSocket plumbing.

Observability: metrics you must export

When a WebSocket service degrades, the classic symptoms — slow responses, random drops, 502s — look identical whether the cause is a backpressure cascade, a proxy misconfiguration, or a network event. The metrics below are the diagnostic layer that tells you which of the three you are actually fighting.

A WebSocket service without these metrics will OOM silently:

Metric	Target
Active connection count	Alert if grows without bound (connection leak)
Per-connection send-queue depth (p95/p99)	Target < 5 messages
Total queued bytes	Target < 5% of heap
Slow-client count	Target < 0.1% of connections
Message latency p99	Target < 100 ms (end-to-end)
Close-code distribution (1000/1006/1013)	Spike in 1006 = network event
Reconnection rate per minute	Spike = server restart or outage

Tools: netstat -an / ss -s for socket state counts, tcpdump for packet-level traces, Prometheus for application metrics, eBPF programs for socket buffer sizes and retransmit rates.

Quiz

A financial trading platform needs to push 1000 price updates per second to 50k browsers in different geographies (US, Europe, Asia; RTT 10–300 ms). Which architecture is correct?

Design challenge

Design a chat application for 1 million concurrent users across US, EU, and APAC. Requirements: message delivery guarantee (at-least-once, no duplicates), reconnection with message history sync, p99 latency < 200 ms for cross-region messages, graceful degradation if one region goes offline. Stack: Redis Streams, PostgreSQL, CDN with edge compute.

Latency p99 < 200 ms even for cross-region messages.
No message loss (at-least-once delivery).
No duplicates even when clients reconnect.
Support 1M concurrent connections.
Graceful degradation if a region goes offline.

Pub/sub decouples senders from receivers: a message from any server reaches clients on any other server, while sticky sessions keep each client's in-flight state on one server across reconnects.

Recall before you leave

01
Name three common Nginx misconfigurations that break long-lived WebSocket connections, and the fix for each.
02
What is cross-site WebSocket hijacking (CSWSH) and how does the Origin header check defend against it?
03
Why is pub/sub (e.g., Redis Streams) necessary for horizontal WebSocket scaling, and what is the role of sticky sessions alongside it?

Recap

Production WebSocket deployments fail most often not from the protocol but from proxy misconfigurations: idle timeouts (default 60s) closing quiet connections, response buffering holding the 101 response indefinitely, and missing Upgrade headers being stripped. Fix by raising timeouts to 3600s, disabling buffering, and setting explicit Upgrade headers — and send a server-side ping every 25–30 seconds to reset proxy idle timers. Security requires: Origin header validation (defense against CSWSH), handshake-level rate limiting (botnet defense), per-connection slow-read timeout (Slow Loris defense), and per-message size limits. Horizontal scale beyond a single server requires a pub/sub backbone (Redis Streams, RabbitMQ) so messages from any server reach clients on any other server, plus sticky sessions to preserve per-connection in-flight state. The hardest engineering is state replication and consistency under server failures — not the WebSocket plumbing. Key observability targets: slow-client count below 0.1%, total queued bytes below 5% of heap, p99 message latency below 100 ms. Now when your on-call alert fires at 2 AM for a WebSocket service, you know exactly which three dashboards to open first — and what to do when you see 1006s spiking versus slow-client count climbing versus heap approaching capacity.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.