Networking & Protocols NET · 08 · 02

WebSocket frame format: opcodes, masking, fragmentation

The 2-byte header that carries every WebSocket message — what FIN, opcode, MASK, and the three-tier length encoding mean, and why client frames must be masked.

NET Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

After the WebSocket handshake the HTTP parser is gone. What flows on the wire is a compact binary format that carries every message — text, binary, keepalive pings, and graceful closes — in as few as 2 bytes of overhead. Understanding that format is what separates “it sometimes works” from “I know exactly what broke.”

The frame header anatomy

If you ever need to debug a WebSocket at the packet level — or implement a custom server — this is the map. Every byte decision in the frame header was made for a specific reason, and knowing those reasons is what lets you spot a malformed frame or a misconfigured proxy in a Wireshark capture.

A WebSocket frame starts with 2 mandatory bytes, followed by optional length extension and masking key fields, and then the payload:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len |    Extended payload length    |
|I|S|S|S|  (4)  |A|     (7)    |             (16/64)           |
|N|V|V|V|       |S|             |   (if payload len==126/127)   |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+-------------------------------+
|     Masking-key (if MASK set, 4 bytes)                        |
+---------------------------------------------------------------+
|                    Payload data                               |
+---------------------------------------------------------------+

Byte 1 breakdown:

FIN (bit 7) — 1 means this is the last (or only) fragment of a message.
RSV1-3 (bits 6-4) — reserved for extensions (e.g., permessage-deflate sets RSV1=1).
Opcode (bits 3-0) — what kind of data the frame carries:

Opcode	Meaning
`0x0`	Continuation frame
`0x1`	Text data (UTF-8)
`0x2`	Binary data
`0x8`	Close
`0x9`	Ping
`0xA`	Pong

Byte 2 breakdown:

MASK (bit 7) — 1 = payload is XOR-masked (client→server always; server→client never).
Payload length (bits 6-0):
- 0–125 — the actual length.
- 126 — next 2 bytes (uint16) hold the real length.
- 127 — next 8 bytes (uint64) hold the real length.

The length field escalates in three tiers — you only pay the extra header bytes once the payload outgrows the previous tier.

Frame overhead totals:

Small server→client frame: 2 bytes header only.
Small client→server frame: 2 bytes header + 4 bytes masking key = 6 bytes.

WebSocket frame overhead at a glance

Minimum frame header (no mask, payload ≤125 bytes): 2 bytes
Client→server overhead (mask required): 6 bytes
Max payload in 7-bit length field: 125 bytes
Length extension for 126–65535 byte payloads: +2 bytes (uint16)
Length extension for larger payloads: +8 bytes (uint64)
Control frames (ping/pong/close) max payload: 125 bytes

Why client frames must be masked

Masking is not encryption — it is a cache-poisoning defense. Here is the attack it prevents:

A malicious JavaScript on site-a.com opens a WebSocket connection to an intermediate proxy. It then sends bytes that happen to spell out a valid HTTP response. If the proxy is naive and stateless, it treats those bytes as HTTP and reflects them to other clients — poisoning its cache.

With masking, the client XORs every payload byte with a 4-byte random key sent in the frame header:

masked_byte[i] = payload[i] XOR mask_key[i % 4]

The receiver XORs back with the same key to recover the original payload. Because the mask key is random per frame, the JavaScript on the malicious site cannot pre-craft bytes that both look like an HTTP response AND decode correctly under XOR. The attack becomes infeasible.

Server frames are not masked because JavaScript on site-a.com cannot read raw bytes from a server response on site-b.com anyway (same-origin policy blocks it).

Fragmentation and continuation frames

A large message can be split across multiple frames. Rules:

First fragment: real opcode (0x1 or 0x2), FIN=0.
Middle fragments: opcode 0x0 (continuation), FIN=0.
Last fragment: opcode 0x0, FIN=1.

The receiver reassembles in order. Control frames (ping, pong, close) cannot be fragmented and are limited to 125 bytes; they can arrive interleaved between data fragments.

Control frames: ping, pong, close

Ping (0x9): a keepalive probe. The receiver must reply with a pong carrying the same payload. Proxies often have idle timeouts (60 seconds is common); sending a ping every 25–30 seconds resets the proxy’s timer and keeps the connection alive.

Pong (0xA): the mandatory reply to a ping. Can also be sent unsolicited as a unilateral heartbeat.

Close (0x8): initiates the closing handshake. The body contains an optional 2-byte status code followed by UTF-8 reason text. Standard codes:

Code	Meaning
1000	Normal closure
1001	Going away (server shutdown, tab closed)
1006	Abnormal closure (no close frame; generated by the implementation)
1008	Policy violation
1011	Unexpected condition
1013	Try again later

After sending a close frame, each side must wait for the peer’s close frame before closing the TCP connection.

▸Why this works

Why RSV bits matter for extensions. The permessage-deflate extension (RFC 7692) negotiated during the handshake uses RSV1=1 to signal that the payload is DEFLATE-compressed. A server that did not negotiate the extension and sees RSV1=1 must close the connection with code 1002 (protocol error). This strict checking ensures extensions cannot silently corrupt frames.

Parsing a small WebSocket frame

1/3

# Server sends "OK" to client # Opcode 0x1 = text, FIN=1, payload = "OK" (2 bytes), no mask Frame bytes (hex): 81 02 4F 4B Byte 1: 0x81 = 10000001 FIN=1 — complete message, no fragments RSV=000 — no extensions active Opcode=0001 — text data (UTF-8) Byte 2: 0x02 = 00000010 MASK=0 — server never masks (correct) Payload length=2 Bytes 3-4: 0x4F 0x4B "O" (0x4F), "K" (0x4B) = "OK" Wire: 2 bytes header + 2 bytes payload = 4 bytes total

Quiz

Why does the client's Sec-WebSocket-Key get transformed into Sec-WebSocket-Accept by adding a fixed GUID, hashing it, and base64-encoding it?

Quiz

Why must client-to-server WebSocket frames be masked, but server-to-client frames must NOT be?

Order the steps

Order the steps of a WebSocket close handshake:

1 One side sends a close frame with status code 1000
2 The other side receives it and replies with a close frame
3 The sender of the second close frame closes the TCP connection
4 Both sides are now in the closed state

Byte 1 — FIN + RSV1-3 + opcode 0x1 text · 0x2 binary · 0x8 close · 0x9 ping · 0xA pong

Byte 2 — MASK flag + payload length len 0-125, or 126→uint16, or 127→uint64

Masking key (client→server only) 4 bytes

Payload data XOR-masked with key on client frames

A minimal server→client frame is just the first 2 bytes; a client→server frame adds the 4-byte masking key — 6 bytes of header total.

Recall before you leave

01
Explain why masking defends against cache-poisoning even though the mask key is sent in plain text inside the frame.
02
A chat server is broadcasting a message to 10,000 connected clients. The broadcast completes in 100 ms. Network RTT is only 5 ms. Where does the other 95 ms come from?
03
What is the FIN bit for in a WebSocket frame, and how does it interact with the opcode?

Recap

Every WebSocket message rides in one or more frames. The 2-byte header encodes the FIN bit (last fragment flag), opcode (text, binary, ping, pong, close), MASK flag, and payload length. Client-to-server frames must XOR their payload with a random 4-byte masking key to prevent cache-poisoning attacks where malicious JavaScript crafts bytes resembling HTTP responses; server-to-client frames are never masked because the same-origin policy already blocks JavaScript from reading cross-origin raw bytes. Large messages may be fragmented across frames using opcode 0x0 (continuation) with FIN=0 on all but the last. Control frames (ping, pong, close) carry at most 125 bytes and cannot be fragmented. Close frames carry a 2-byte status code; 1000 is normal, 1006 is generated when no close frame was received. Now when you see a 1006 in production logs, you know immediately that something — a proxy, a network failure, a crashed peer — terminated the TCP connection without sending a close frame first.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

deepens into

WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior

appears again in178

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.