Networking & Protocols
TCP handshake: measure and tame a connection-churn problem
Reading about handshakes and TIME-WAIT is not the same as watching a service drown in connection churn and pulling it out. Build a small client/server, drive it into a real per-request-connection problem, and apply the unit’s levers — pooling, tuning, congestion control — until the numbers come back, with packet-level evidence at every step.
Turn the unit’s mental model into a reproducible engineering loop: capture the handshake on the wire, reproduce a connection-churn latency and port-exhaustion failure, fix it structurally, and verify with before/after metrics measured under identical load.
Take a client that opens a fresh TCP connection per request to a backend, reproduce the latency and TIME-WAIT/port-exhaustion failure it causes under load, then eliminate it with connection pooling plus the right kernel tuning — proving each step with packet captures, ss output, and measured latency, not estimates.
- A before/after table: p50/p99 request latency, handshakes per second, and TIME-WAIT socket count — all measured under the same load, not estimated.
- An annotated handshake capture (before) and a captured request reusing an established connection (after), showing the missing handshake round-trip.
- TIME-WAIT count and EADDRNOTAVAIL errors are gone (or sharply reduced) after pooling, confirmed from ss, with a one-line justification for each kernel setting you changed.
- A short write-up naming the one structural lever (connection reuse) and explaining why tuning sysctls alone — including a shorter MSL — would not have been the correct fix.
- Add a SYN-flood demo against the listener with tcp_syncookies on, capture the SyncookiesSent/Recv counters from nstat, and explain what legitimate long-RTT clients lose while cookies are active.
- On a netem path with ~1% random loss, measure throughput under CUBIC vs BBR with ss -tin (watch cwnd, ssthresh, retrans) and show BBR sustaining throughput where CUBIC collapses.
- Add an on-call runbook: triage TIME-WAIT exhaustion and CLOSE-WAIT leaks from ss in under five minutes, the fix-priority ladder (pool before tw_reuse before port-range before MSL), and a verification checklist.
- Reproduce the Nagle plus delayed-ACK 40–200 ms stall on small pipelined writes, then show TCP_NODELAY removing it, with before/after p99 from the same trace.
This is the loop you will run in every real connection-churn incident: capture the handshake to see the round-trip you are paying, reproduce the failure (latency from per-request connections, port exhaustion from TIME-WAIT), fix it at the structural level with a keep-alive pool, add only the kernel tuning the evidence justifies, and verify with before/after numbers under identical load. Doing it once on a toy service makes the production version muscle memory — and teaches you why pooling, not a shorter MSL, is the correct lever.