Networking & Protocols NET · 08 · 06

WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10M

The three ceilings that gate WebSocket scale — RAM per idle connection, file descriptors, and NIC bandwidth — and how HTTP/2 extended CONNECT and compression tuning shift those limits.

NET Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Phoenix demonstrated 2 million concurrent WebSocket connections on one 64 GB server. MigratoryData reports 10 million on commodity hardware. Getting there requires understanding three independent ceilings: RAM per idle connection, file descriptor limits, and NIC bandwidth — and the protocol evolution that shifts those ceilings.

Three ceilings in order

When you hit a scale limit, the fix depends entirely on which ceiling you have reached — RAM, file descriptors, or NIC. Treating the wrong one wastes weeks. Here they are in the order you encounter them.

Ceiling 1 — RAM per idle connection. An idle WebSocket connection consumes kernel socket buffers (send + receive), plus the framework’s application state. At default Linux settings:

Kernel socket buffers: ~4–8 KB (socket rmem/wmem default).
Framework connection state: varies by language/framework (Node.js: ~10–50 KB per connection due to V8 object overhead; Go: ~2–5 KB; Rust: ~1–2 KB).

With permessage-deflate at default window_bits=15: +256 KB per compressor + 44 KB per decompressor = +300 KB per connection.

At 10 M connections × 10 KB base = 100 GB RAM minimum. This is why Phoenix’s 2 M on 64 GB is impressive — they strip per-connection overhead to under 32 KB per connection.

Ceiling 2 — File descriptors. Linux defaults are 1,024 per process. Production servers raise with ulimit -n or LimitNOFILE in systemd. Maximum practical limit per process is around 1 M on modern Linux. At 10 M connections per server you need 10 M file descriptors — requires kernel tuning (fs.nr_open, fs.file-max).

Ceiling 3 — NIC bandwidth. Keepalive pings every 25 s at 125-byte payload: 10 M connections × 4 pings/minute × 125 bytes = ~83 MB/s of ping traffic alone on a 1 Gbps NIC. NIC saturation sets the practical ceiling for idle connections around 500 k–2 M depending on ping interval and message rate.

WebSocket at scale — key numbers

RAM per idle connection (framework-dependent): 2–50 KB
permessage-deflate overhead per connection (default): ~300 KB
Phoenix: concurrent connections on 64 GB, single server: 2 million
MigratoryData: concurrent on commodity hardware: 10 million
Linux fd default per process: 1,024 (raise to 1 M in production)
Practical single-server ceiling (RAM or NIC): 500 k–2 M idle connections

HTTP/2 extended CONNECT (RFC 8441)

Opening 100 WebSocket connections via HTTP/1.1 Upgrade means 100 TCP 3-way handshakes (~30 ms each at 10 ms RTT) plus 100 TLS sessions (~20 ms resuming). Each TCP connection consumes a file descriptor and ~8 KB of kernel buffers.

RFC 8441 defines extended CONNECT for WebSocket over HTTP/2. Instead of GET + Upgrade, the client sends:

HEADERS stream=5
  :method = CONNECT
  :protocol = websocket
  :scheme = https
  :path = /chat
  :authority = example.com
  sec-websocket-protocol = chat

The server replies with 200 OK (not 101). The HTTP/2 stream 5 becomes the WebSocket tunnel. Other HTTP/2 streams on the same TCP connection carry different traffic simultaneously.

Benefits:

100 WebSocket connections share one TCP connection — one 3-way handshake total.
Zero extra file descriptors for additional connections.
Congestion control tuning on the shared connection benefits all streams.
Cumulative handshake latency for 100 connections: ~30 ms instead of ~3,000 ms.

Trade-off: requires HTTP/2 end-to-end (client, server, all proxies). As of 2026, major cloud load balancers (AWS ALB, GCP Load Balancer) support it, but most corporate proxies and firewalls do not. Adoption is limited to hyperscale operators.

WebSocket over HTTP/3 (RFC 9220)

HTTP/3 runs over QUIC (UDP-based). RFC 9220 defines WebSocket over HTTP/3 using the same extended CONNECT with :protocol websocket.

Key difference from HTTP/2: QUIC streams are independent at the transport layer — a lost packet on stream 0 does not block stream 4 (no head-of-line blocking). HTTP/2 over TCP still suffers TCP-level HoL blocking: a lost TCP segment blocks all HTTP/2 streams on that connection.

As of early 2026, no major browser or server ships production WebSocket-over-HTTP/3 support. WebTransport (a QUIC-native protocol) is the preferred choice for applications needing QUIC semantics: it offers bidirectional streams plus unreliable datagrams and has ~75% browser support (Chrome 120+, Firefox 127+, Safari 17+).

permessage-deflate tuning

The permessage-deflate extension (RFC 7692) compresses each message with DEFLATE. Compression ratio:

50–90% on text payloads (JSON, HTML).
Poor on small messages (< 64 bytes) — can expand.
~0% on already-compressed data (images, video, encrypted payloads).

Memory cost per connection at default window_bits=15:

Compressor: ~256 KB
Decompressor: ~44 KB

At window_bits=12: ~50 KB total per connection. The trade-off is lower compression ratio (32 KB history window vs. 32 KB at window_bits=15… wait, window_bits=15 = 32 KB, window_bits=12 = 4 KB). Compression ratio drops from ~70% to ~50% on typical JSON payloads.

The default deflate compressor costs ~300 KB per connection — at 100k connections that is 30 GB of pure compression state, which is why tuning window_bits down to 12 (~50 KB) or disabling deflate is the first RAM-ceiling lever.

Production strategies:

Disable entirely for idle connections and connections sending small messages or binary data.
Set window_bits=10–12 for memory-constrained deployments.
Enable context_takeover=false to reset compression state per message — trades ratio for lower per-message overhead.

▸Why this works

Why head-of-line blocking matters for WebSocket specifically. WebSocket multiplexes all application channels onto one TCP connection. If your application sends independent data channels (game entities A, B, C), a TCP retransmit on entity A’s packet blocks the entire connection for ~10–30 ms (typical retransmit timer). Entity B and C updates stall even though their packets were delivered. Mitigation: use multiple WebSocket connections (one per independent channel), or move to WebTransport/QUIC which has stream-level independence. Most production applications tolerate this HoL risk for interactive latency (< 100 ms) without special handling.

Trace it

1/5

Trace the memory and handshake savings of moving 1,000 WebSocket connections from HTTP/1.1 to HTTP/2 extended CONNECT.

Step 1 of 5

HTTP/1.1: 1,000 connections. Each idle connection uses 10 KB kernel socket buffers. What is the total kernel memory?

Locked

HTTP/1.1: opening those 1,000 connections one by one at 10 ms RTT. How much total handshake latency?

Locked

HTTP/2 extended CONNECT: 1,000 WebSocket connections share one TCP connection. How much kernel memory for socket buffers?

Locked

HTTP/2: how many TCP handshakes?

Locked

What is the deployment constraint that limits HTTP/2 extended CONNECT adoption?

Quiz

A server has 100k idle WebSocket connections with permessage-deflate enabled at default window_bits=15. Memory used by compression state alone is approximately?

1. RAM per idle connection 2-50 KB +300 KB deflate → HTTP/2 CONNECT

2. File descriptors default 1024 → raise ulimit to ~1 M

3. NIC bandwidth ping traffic saturates 1 Gbps → shard servers

You hit RAM first, then run out of file descriptors, then saturate the NIC — each ceiling has a different lever, so identify which one you are actually against.

Recall before you leave

01
Explain why HTTP/2 extended CONNECT can support multiple concurrent WebSocket connections more efficiently than opening multiple separate TCP connections.
02
What is the effect of setting server_no_context_takeover in permessage-deflate negotiation, and when should you use it?
03
Name the three independent ceilings that limit WebSocket connections per single server, and state the practical limit each imposes.

Recap

WebSocket connections at scale hit three independent ceilings. RAM is first: each idle connection uses 2–50 KB of framework state and kernel buffers; permessage-deflate at default settings adds 300 KB more per connection, making 100k connections consume 30 GB in compression state alone — most production deployments disable compression or tune window_bits down. File descriptors are second: Linux defaults to 1,024 per process; production servers raise this to 100k–1 M via ulimit -n. NIC bandwidth is third: ping traffic alone saturates a 1 Gbps NIC at around 1–2 M idle connections. HTTP/2 extended CONNECT (RFC 8441) addresses the TCP and file descriptor ceilings by multiplexing 100 WebSocket connections over one TCP connection — 100× reduction in kernel buffer memory and handshake latency — but requires an HTTP/2-aware proxy chain not yet common outside hyperscale deployments. WebSocket over HTTP/3 (RFC 9220) adds QUIC-level stream independence but has no production browser support as of 2026; WebTransport is the current QUIC-native alternative. Now when someone reports “we can’t add more connections,” your first question is which ceiling they are against: check memory graphs, then ulimit -n output, then NIC throughput — each has a different lever.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

deepens into

WebSocket in production: proxies, security, and distributed architecturesenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.