Networking & Protocols
TCP options and common pathologies
A Redis client call takes 200 ms even though Redis is on the same datacenter host. No load, no timeouts, no errors — just a 200 ms delay on every small request. The cause is two TCP defaults interacting in the worst possible way, and fixing it takes one socket option.
TCP header anatomy
Every TCP segment carries a 20-byte mandatory header plus optional fields (up to 40 bytes more):
| Field | Size | Purpose |
|---|---|---|
| Source port | 16 bits | Part of the four-tuple identifying the connection |
| Destination port | 16 bits | Part of the four-tuple |
| Sequence number | 32 bits | Position of the first data byte in the stream |
| Acknowledgement number | 32 bits | Next byte the sender expects from the peer |
| Data offset | 4 bits | Header length in 32-bit words (varies with options) |
| Flags | 8 bits | URG, ACK, PSH, RST, SYN, FIN, CWR, ECE |
| Window | 16 bits | Bytes receiver is willing to accept |
| Checksum | 16 bits | Covers a pseudo-header from IP plus the TCP segment |
| Urgent pointer | 16 bits | Almost never used; legacy of old terminal protocols |
The options field (up to 40 bytes) carries MSS, Window Scale, SACK, Timestamps, and TFO cookie. Options are negotiated only in SYN/SYN-ACK — once the connection is established, the option set is fixed.
TCP options field
Beyond MSS, Window Scale, and SACK:
Timestamps (RFC 7323): used for both RTT measurement and PAWS (Protection Against Wrapped Sequence numbers). PAWS prevents old delayed packets from being mistaken for new data on long-lived high-bandwidth connections where sequence numbers wrap around.
The TFO cookie also lives in the options field. See the SYN cookies and TFO lesson for details.
- Mandatory header
- 20 bytes
- Max options
- 40 bytes (total header max 60 bytes)
- Flags byte
- URG ACK PSH RST SYN FIN CWR ECE
- Window field (raw)
- 16 bits, max 65535 bytes
- With Window Scale
- up to 1 GiB
- Checksum scope
- IP pseudo-header + TCP segment
Nagle’s algorithm and delayed ACK — the 200 ms trap
By default a TCP sender batches small writes via Nagle’s algorithm: do not send a partial segment if there is unacknowledged data in flight. Independently, the receiver delays ACKs up to 40 ms (Linux default) hoping to piggyback the ACK on outgoing application data.
Combined, these two create a deadly interaction on small request-response patterns:
- Client sends a small request. The request is less than MSS.
- Server’s TCP stack holds the ACK up to 40 ms hoping to piggyback on an outgoing response.
- Client cannot send its next small write because Nagle is holding it (waiting for the ACK of the first write).
- Server’s application eventually responds and the delayed ACK fires alongside the response.
Result: a 40–200 ms stall on every small interactive exchange.
The fix: TCP_NODELAY=1 on the socket disables Nagle. gRPC, Redis clients, HTTP/2 implementations, and any latency-sensitive RPC layer set it by default. TCP_QUICKACK=1 tells Linux to ACK the next segment immediately (it auto-resets after one packet, so call it after each read() in a tight loop).
Trace the Nagle + delayed-ACK 200 ms stall on a Redis client without TCP_NODELAY.
PSH flag
The PSH (push) flag tells the receiver’s TCP stack to deliver buffered data to the application immediately rather than waiting for more. Modern stacks pass data up to the application as soon as it arrives, so PSH is more a hint than a mandate. It is set on the last segment of every write() that completes a logical message — useful when the kernel’s TCP stack coalesces multiple application writes into one segment, ensuring the receiver still delivers the record.
Explicit Congestion Notification (ECN)
Instead of dropping packets to signal congestion, ECN-capable routers mark them with a 2-bit code (CE = Congestion Experienced). The TCP header reserves CWR (Congestion Window Reduced) and ECE (ECN-Echo) flags. ECN negotiation happens in the SYN/SYN-ACK exchange: both sides advertise capability via ECE+CWR flags. When CE-marked packets arrive, the receiver sets ECE to inform the sender; the sender reduces cwnd and sets CWR to confirm.
ECN is enabled by default in Linux and macOS for connections to known-good destinations. Some old middleboxes drop ECN-marked packets — deployments use fallback detection and disable ECN to those destinations.
Keepalive
By default a TCP connection sends no packets when idle, so a connection through a NAT or firewall may be silently dropped after 5–60 minutes. SO_KEEPALIVE sends a probe every tcp_keepalive_time seconds (Linux default: 7200 s = 2h — far too long for service-mesh use). For long-lived RPC connections, tune:
tcp_keepalive_time: 60–120 stcp_keepalive_intvl: 10–30 stcp_keepalive_probes: 3–5
This detects dead peers within a minute rather than two hours.
ss output — diagnose the CLOSE-WAIT accumulation
$ ss -tan state established | wc -l
12384
$ ss -tan state close-wait | wc -l
9821
$ ss -tan state time-wait | wc -l
1247
$ ss -s
Total: 12500
TCP: 23552 (estab 12384, closed 8920, orphaned 2, timewait 1247)
$ ps -p 1234 -o pid,stat,rss,vsz,cmd
PID STAT RSS VSZ CMD
1234 Ssl 8392000 12000000 /usr/bin/app-server The service process has 12k ESTABLISHED + 9.8k CLOSE-WAIT sockets and RSS is climbing. What is the bug and what is the fix?
Why does the combination of Nagle's algorithm and delayed ACK cause a ~200 ms stall on small RPC traffic?
Why this works
Why PAWS exists. On a very high-bandwidth, long-lived connection, TCP’s 32-bit sequence number space (~4 GiB) can wrap around in hours. Without PAWS, a delayed retransmitted segment from an earlier pass through the sequence space could arrive and look like a valid new segment. RFC 7323 Timestamps enable PAWS: each segment carries a timestamp, and the receiver rejects any segment whose timestamp is older than a recently-seen value on that connection. This makes wrap-around attacks and accidental wrap-around corruption effectively impossible.
- 01Explain the Nagle + delayed-ACK stall: which side causes which delay and what is the standard fix?
- 02What does PAWS protect against and what TCP option enables it?
- 03What does a large CLOSE-WAIT count on a server indicate and how do you find the root cause?
The TCP header is 20 mandatory bytes plus up to 40 bytes of options. Key options (MSS, Window Scale, SACK, Timestamps, TFO cookie) are negotiated in SYN/SYN-ACK and then fixed. Nagle’s algorithm and delayed ACK are both default behaviours that interact catastrophically on small-write interactive traffic — the sender holds a write waiting for an ACK the receiver is delaying for 40 ms. TCP_NODELAY disables Nagle and is standard for any RPC client. CLOSE-WAIT accumulation is the canonical socket-leak symptom: the peer sent FIN but the application never called close(). ECN marks congested packets instead of dropping them; SO_KEEPALIVE probes idle connections to detect dead peers before NAT or firewall state expires silently.