awesome-everything RU
↑ Back to the climb

Networking & Protocols

TCP options and common pathologies

Crux The TCP header anatomy, Nagle+delayed-ACK stalls, ECN, keepalive, and the CLOSE-WAIT socket-leak trap every backend engineer eventually hits.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 15 min

A Redis client call takes 200 ms even though Redis is on the same datacenter host. No load, no timeouts, no errors — just a 200 ms delay on every small request. The cause is two TCP defaults interacting in the worst possible way, and fixing it takes one socket option.

TCP header anatomy

Every TCP segment carries a 20-byte mandatory header plus optional fields (up to 40 bytes more):

FieldSizePurpose
Source port16 bitsPart of the four-tuple identifying the connection
Destination port16 bitsPart of the four-tuple
Sequence number32 bitsPosition of the first data byte in the stream
Acknowledgement number32 bitsNext byte the sender expects from the peer
Data offset4 bitsHeader length in 32-bit words (varies with options)
Flags8 bitsURG, ACK, PSH, RST, SYN, FIN, CWR, ECE
Window16 bitsBytes receiver is willing to accept
Checksum16 bitsCovers a pseudo-header from IP plus the TCP segment
Urgent pointer16 bitsAlmost never used; legacy of old terminal protocols

The options field (up to 40 bytes) carries MSS, Window Scale, SACK, Timestamps, and TFO cookie. Options are negotiated only in SYN/SYN-ACK — once the connection is established, the option set is fixed.

TCP options field

Beyond MSS, Window Scale, and SACK:

Timestamps (RFC 7323): used for both RTT measurement and PAWS (Protection Against Wrapped Sequence numbers). PAWS prevents old delayed packets from being mistaken for new data on long-lived high-bandwidth connections where sequence numbers wrap around.

The TFO cookie also lives in the options field. See the SYN cookies and TFO lesson for details.

TCP header at a glance
Mandatory header
20 bytes
Max options
40 bytes (total header max 60 bytes)
Flags byte
URG ACK PSH RST SYN FIN CWR ECE
Window field (raw)
16 bits, max 65535 bytes
With Window Scale
up to 1 GiB
Checksum scope
IP pseudo-header + TCP segment

Nagle’s algorithm and delayed ACK — the 200 ms trap

By default a TCP sender batches small writes via Nagle’s algorithm: do not send a partial segment if there is unacknowledged data in flight. Independently, the receiver delays ACKs up to 40 ms (Linux default) hoping to piggyback the ACK on outgoing application data.

Combined, these two create a deadly interaction on small request-response patterns:

  1. Client sends a small request. The request is less than MSS.
  2. Server’s TCP stack holds the ACK up to 40 ms hoping to piggyback on an outgoing response.
  3. Client cannot send its next small write because Nagle is holding it (waiting for the ACK of the first write).
  4. Server’s application eventually responds and the delayed ACK fires alongside the response.

Result: a 40–200 ms stall on every small interactive exchange.

The fix: TCP_NODELAY=1 on the socket disables Nagle. gRPC, Redis clients, HTTP/2 implementations, and any latency-sensitive RPC layer set it by default. TCP_QUICKACK=1 tells Linux to ACK the next segment immediately (it auto-resets after one packet, so call it after each read() in a tight loop).

Trace it
1/5

Trace the Nagle + delayed-ACK 200 ms stall on a Redis client without TCP_NODELAY.

1
Step 1 of 5
Client sends a 10-byte PING command to Redis. What does Nagle do?
2
Locked
Server's Redis receives the PING. What does the kernel do with the ACK?
3
Locked
Redis processes the command in 0.1 ms and calls write('+PONG\r\n'). Does the ACK go now?
4
Locked
Now the client sends a PIPELINE of two commands in quick succession (write(CMD1) then write(CMD2)). What happens with Nagle?
5
Locked
Fix: set TCP_NODELAY=1. What changes?

PSH flag

The PSH (push) flag tells the receiver’s TCP stack to deliver buffered data to the application immediately rather than waiting for more. Modern stacks pass data up to the application as soon as it arrives, so PSH is more a hint than a mandate. It is set on the last segment of every write() that completes a logical message — useful when the kernel’s TCP stack coalesces multiple application writes into one segment, ensuring the receiver still delivers the record.

Explicit Congestion Notification (ECN)

Instead of dropping packets to signal congestion, ECN-capable routers mark them with a 2-bit code (CE = Congestion Experienced). The TCP header reserves CWR (Congestion Window Reduced) and ECE (ECN-Echo) flags. ECN negotiation happens in the SYN/SYN-ACK exchange: both sides advertise capability via ECE+CWR flags. When CE-marked packets arrive, the receiver sets ECE to inform the sender; the sender reduces cwnd and sets CWR to confirm.

ECN is enabled by default in Linux and macOS for connections to known-good destinations. Some old middleboxes drop ECN-marked packets — deployments use fallback detection and disable ECN to those destinations.

Keepalive

By default a TCP connection sends no packets when idle, so a connection through a NAT or firewall may be silently dropped after 5–60 minutes. SO_KEEPALIVE sends a probe every tcp_keepalive_time seconds (Linux default: 7200 s = 2h — far too long for service-mesh use). For long-lived RPC connections, tune:

  • tcp_keepalive_time: 60–120 s
  • tcp_keepalive_intvl: 10–30 s
  • tcp_keepalive_probes: 3–5

This detects dead peers within a minute rather than two hours.

Debug this

ss output — diagnose the CLOSE-WAIT accumulation

log
$ ss -tan state established | wc -l
12384
$ ss -tan state close-wait | wc -l
9821
$ ss -tan state time-wait | wc -l
1247
$ ss -s
Total: 12500
TCP:   23552 (estab 12384, closed 8920, orphaned 2, timewait 1247)
$ ps -p 1234 -o pid,stat,rss,vsz,cmd
PID STAT  RSS    VSZ CMD
1234 Ssl 8392000 12000000 /usr/bin/app-server

The service process has 12k ESTABLISHED + 9.8k CLOSE-WAIT sockets and RSS is climbing. What is the bug and what is the fix?

Quiz

Why does the combination of Nagle's algorithm and delayed ACK cause a ~200 ms stall on small RPC traffic?

Why this works

Why PAWS exists. On a very high-bandwidth, long-lived connection, TCP’s 32-bit sequence number space (~4 GiB) can wrap around in hours. Without PAWS, a delayed retransmitted segment from an earlier pass through the sequence space could arrive and look like a valid new segment. RFC 7323 Timestamps enable PAWS: each segment carries a timestamp, and the receiver rejects any segment whose timestamp is older than a recently-seen value on that connection. This makes wrap-around attacks and accidental wrap-around corruption effectively impossible.

Recall before you leave
  1. 01
    Explain the Nagle + delayed-ACK stall: which side causes which delay and what is the standard fix?
  2. 02
    What does PAWS protect against and what TCP option enables it?
  3. 03
    What does a large CLOSE-WAIT count on a server indicate and how do you find the root cause?
Recap

The TCP header is 20 mandatory bytes plus up to 40 bytes of options. Key options (MSS, Window Scale, SACK, Timestamps, TFO cookie) are negotiated in SYN/SYN-ACK and then fixed. Nagle’s algorithm and delayed ACK are both default behaviours that interact catastrophically on small-write interactive traffic — the sender holds a write waiting for an ACK the receiver is delaying for 40 ms. TCP_NODELAY disables Nagle and is standard for any RPC client. CLOSE-WAIT accumulation is the canonical socket-leak symptom: the peer sent FIN but the application never called close(). ECN marks congested packets instead of dropping them; SO_KEEPALIVE probes idle connections to detect dead peers before NAT or firewall state expires silently.

Connected lessons
Continue the climb ↑SYN cookies, TFO, and TIME-WAIT at scale
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.