Performance PERF · 06 · 05

From Nagle to io_uring: evolution of batching

One pattern runs from Nagle (1984) through Kafka linger.ms to io_uring: amortize a fixed per-op cost over N ops. The lever is the same; only the fixed cost (TCP header, round-trip, syscall) and the window-tuning move change.

PERF Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

In 1984, John Nagle watched a single telnet keystroke leave a host as a 41-byte packet: one byte of payload wrapped in a 40-byte TCP/IP header. On the congested ARPANET that was 97.5% waste, and thousands of those tinygrams were melting links. His fix was three lines of TCP logic. Forty years later you tune the exact same lever every time you set linger.ms in Kafka or batch SQEs into one io_uring_enter. The fixed cost changed names; the move never did.

The batching insight is timeless: when a fixed cost per operation dominates the variable cost of the payload — a TCP header, a network round-trip, a syscall context switch, a DB connection acquire — group operations so you pay the fixed cost once instead of N times. Everything in this lesson is that one idea, applied at a different layer of the stack across four decades. The only real design knob is the window: how long to wait, or how full to get, before you flush.

Nagle’s algorithm: the original batching tradeoff

RFC 896 (1984) introduced what we now call Nagle’s algorithm. The rule is small: while there is unacknowledged data already in flight, hold back any new small segment and coalesce it with later writes; flush immediately only when an ACK clears the in-flight data, or when you have a full MSS worth of bytes to send. The motivation was brutal arithmetic — a 1-byte telnet keystroke became a 41-byte packet, so 40 of every 41 bytes on the wire were header. Nagle’s rule turned a burst of keystrokes into one packet per round-trip instead of one packet per key.

This is the thesis in one picture: the fixed per-op cost (the 40-byte header) dwarfs the variable payload, so amortizing it over many writes is pure win. Every batcher since — linger.ms, io_uring — fights the same lopsided ratio.

The cost lands on interactive and request/response traffic. If your application does a small write() and then waits for a reply, Nagle may sit on that last small segment, hoping for more data that never comes — so it waits for the ACK instead, adding up to a full round-trip of dead time. The escape hatch is the TCP_NODELAY socket option, which disables the algorithm so every write goes out immediately. This is why HTTP/2, gRPC, Redis clients, and basically every modern RPC stack set TCP_NODELAY at connect time and do their own batching at the application layer, where they actually know message boundaries.

The Nagle + delayed-ACK deadlock (the postmortem beat)

The famous failure mode is not Nagle alone — it is Nagle interacting with TCP delayed ACK. Delayed ACK is the receiver-side mirror of Nagle: instead of ACKing every segment, the receiver waits (Linux default: up to ~40ms) hoping to piggyback the ACK onto a reply or batch it with the next one. Now compose the two. The sender writes a response slightly larger than one MSS: the first full segment goes out, but the small trailing segment is held by Nagle because the first is still unacknowledged. The receiver got the first segment but holds its ACK under delayed-ACK, waiting to piggyback. Neither side will move. The deadlock breaks only when the 40ms delayed-ACK timer fires.

The symptom in production is unmistakable and infuriating: a protocol that should do thousands of transactions per second mysteriously caps near 25/sec, with latency histograms spiking at a suspiciously round 40ms (or 200ms on some stacks). Marc Brooker’s line — “It’s always TCP_NODELAY. Every damn time.” — is folklore for a reason. The fix is one socket option; the diagnosis is the hard part, because the 40ms is paid by nobody’s CPU and shows up only as wall-clock stall.

The latency-throughput Pareto frontier

Strip away the layer-specific details and every batching system traces the same curve. On one axis: batch window (time or size). On the other two: throughput and per-item latency, which move in opposition.

Operating point	Window	Per-item latency	Throughput
No batching	window = 0	Minimum (send now)	Capped by fixed cost per op
SLO operating point	largest window with `p99 < SLO`	At the SLO ceiling	Near-max under that ceiling
Infinite batching	window = `latency = ∞`	Unbounded (first item never flushes)	Max theoretical

The senior workflow is not “pick a number” — it is: define the latency SLO ceiling first (say p99 < 50ms), then find the largest batch window that still fits under it, because that window gives you the most throughput you can buy without breaking the contract. Static systems tune this knob once and live with it. Adaptive systems track the curve at runtime: under light load they shrink the window toward zero (latency matters, there is nothing to batch anyway); under heavy load they let batches fill (throughput matters, and items are arriving fast enough that the wait is cheap). Kafka 4.0 quietly encoded this wisdom: the producer linger.ms default moved from 0 to 5ms, because the efficiency win from fuller batches usually pays for the 5ms wait — frequently yielding lower end-to-end latency, not higher, by reducing per-request overhead.

Batch coalescing and request deduplication

What if the fixed cost is not a syscall or a network hop, but a backend query — and a hundred callers all want the same answer? Batching in time is not enough; you also need to batch across callers.

There is a sharper variant for cache/lookup workloads where many callers want the same result, not just any throughput. When concurrent requests miss the cache for the same key, you can collapse them into a single inflight load instead of N duplicate loads. Worker A misses key K and starts the DB query; worker B (and C, and D…) also miss K, see A’s request already pending, and attach to it rather than firing their own. One result fans out to all of them.

This is the cure for the cache stampede / thundering herd: a hot key expires, 100 requests arrive in the same millisecond, and without coalescing all 100 hammer the database at once — often enough to knock it over right when traffic is highest. With coalescing, those 100 misses become 1 query and 99 free-riders. The implementations are everywhere under different names: Go’s golang.org/x/sync/singleflight, Java’s Caffeine AsyncLoadingCache, request collapsing in Varnish and most CDNs. GraphQL’s DataLoader takes it one step further by combining coalescing with windowed batch loading: every distinct key requested within one tick is deduped and the unique keys are bundled into a single batched backend query, which is also how DataLoader kills the N+1 query problem.

▸Why this works

Coalescing and Nagle look different but share a spine. Nagle merges writes in time on one connection to amortize header cost. Singleflight merges reads across callers on one key to amortize a backend query. Both answer the same question — “several small things want the same expensive operation; can one trip serve them all?” — just along different dimensions (time vs. identity).

Why the lineage matters

The whole point of seeing Nagle, Kafka, and io_uring as one family is that the tuning method transfers. Each is just a different fixed cost wrapped in the same lever, so the diagnostic question is identical every time.

Era	System	Fixed cost amortized	Window knob
1984	Nagle / RFC 896	40-byte TCP/IP header per segment	in-flight ACK or full MSS
2011	Kafka producer `linger.ms`	network round-trip + broker request	`linger.ms` + `batch.size`
2019	io_uring (Linux 5.1)	syscall + context switch into kernel	SQEs queued before one `io_uring_enter`

io_uring is the cleanest modern echo: instead of one syscall per I/O, you fill a ring of submission queue entries (SQEs) in shared memory and submit a whole batch with a single io_uring_enter — amortizing the kernel-boundary crossing across many operations, exactly as Nagle amortized the header across many keystrokes. Same lever, new fixed cost. Once you see the pattern, the work is always the same three steps: measure the fixed cost vs. the variable cost, confirm the fixed cost actually dominates, then size the window to the largest value your latency SLO allows.

Quiz

A request/response service over TCP mysteriously caps near 25 transactions/sec, with latencies clustered at exactly 40ms. What's the most likely cause?

Quiz

A hot cache key expires and 100 concurrent requests miss it in the same millisecond. What does request coalescing (singleflight) do?

Order the steps

Order the senior workflow for tuning any batch window:

1 Measure the fixed cost per op vs. the variable cost of the payload
2 Confirm the fixed cost actually dominates (else batching buys little)
3 Define the latency SLO ceiling first (e.g. p99 under 50ms)
4 Find the largest batch window that still fits under that ceiling
5 Decide static vs. adaptive: fixed knob, or shrink/grow the window with load

Pick the best fit

A low-traffic internal RPC service does small request/response calls and is hitting a fixed ~40ms latency floor per call. Pick the fix a senior defends.

N operations cross the kernel boundary once, not N times — the same lever Nagle used for TCP headers. SQPOLL mode polls the SQ tail for zero syscalls per op under sustained load.

Recall before you leave

01
What problem did Nagle's algorithm solve, what is the mechanism, and how does the classic deadlock arise?
02
Explain the lineage Nagle → Kafka linger.ms → io_uring as one pattern, and the workflow for tuning the window.

Recap

Batching is one lever applied across four decades: when a fixed per-operation cost (TCP header, round-trip, syscall) dominates the variable cost of the payload, group operations to pay it once. Nagle’s algorithm (1984) coalesced tiny TCP writes until an ACK or a full MSS, and its famous deadlock with delayed ACK pins latency at 40ms until TCP_NODELAY turns it off. Kafka’s linger.ms and io_uring’s batched SQE submission are the same move at higher layers, which is why the tuning workflow never changes: measure fixed vs. variable cost, define the latency SLO ceiling, then take the largest window that fits under it — statically or adaptively. Request coalescing (singleflight, Caffeine, DataLoader) applies the idea across callers instead of over time, collapsing a thundering herd of identical cache misses into a single backend load. Now when you hit a latency floor that burns no CPU and clusters at a suspiciously round number, reach for TCP_NODELAY before anything else — it is almost always Nagle.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

Backpressure, failure isolation, and batch security in productionsenior

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.