awesome-everything RU
↑ Back to the climb

Performance

Batching: amortize fixed cost per operation

Crux When per-operation price is mostly fixed (syscall, network round-trip, ACK, log flush), grouping operations amortizes it. The window — size plus max-wait — decides where throughput meets tail latency.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 8 min

A team chases a slow ingest pipeline for two weeks. The CPU is idle, the network is idle, the disk is idle — yet 1000 rows take 9 seconds to insert. Nothing is “busy.” The profiler finally points at the obvious: 1000 separate INSERT statements, 1000 separate round-trips to Postgres, each one waiting for an ACK before the next begins. They swapped the loop for one COPY. Same hardware, same rows: 14 milliseconds. They never made anything faster — they just stopped paying the toll a thousand times.

The fixed-vs-variable cost model

Every operation has two prices. The variable price scales with the data: serializing a 4 KB payload costs more than a 40 byte one. The fixed price is paid per call no matter how small the payload — and it is usually the expensive half. A syscall pays the user→kernel mode transition and back. A network call pays a full round-trip plus the ACK. A database write pays query planning, a transaction commit, and a WAL flush. A log line pays an fsync.

Write the total cost of N operations as total = N * (F + V), where F is fixed and V is variable per item. When F dominates V and N is large, almost everything you pay is F, repeated N times — and the data itself is rounding error. Batching changes the shape to total = F + N * V: you pay F once and V per item. The fixed cost is amortized across the whole batch. That single algebraic move — pulling F outside the loop — is the entire idea.

This is why the slow pipeline was slow. The CPU, disk, and network all looked idle because the bottleneck was latency, not utilization: each operation spent its life waiting for the previous round-trip to complete. The fixed cost was never CPU time you could see in a flame graph — it was dead time on the wire and in the kernel.

Fixed cost lives at every layer

The reason batching shows up everywhere — TCP’s Nagle algorithm, Kafka’s linger.ms, Postgres COPY, Redis pipelining, io_uring submission queues, syslog buffering — is that every layer of the stack has its own per-operation toll. Knowing which fixed cost you are amortizing tells you how big the win will be.

LayerFixed cost per opBatched asReported gain
Syscalluser↔kernel mode transition (~hundreds of ns each)io_uring / writev / batched submissionmillions of IOPS without per-op syscall
Network (Redis)full RTT + ACK, paid per commandpipelining (send many, read replies once)10k PINGs: 1.19s → 0.25s (~5x)
Broker (Kafka)produce request + replication ACKbatch.size + linger.ms~8k → ~150k msg/s with batching on
Database (Postgres)parse + plan + commit + WAL flushCOPY / multi-row INSERT10M rows: 9000s of single INSERTs → 14s COPY

The Redis case is the cleanest illustration of the model. Over a 250 ms link, a server that can serve 100k requests/sec is capped at 4 requests/sec if the client waits for each reply — because the bottleneck is the RTT, paid per command. Pipeline the commands and you pay one RTT for the whole batch: throughput jumps back toward the server’s real ceiling. The hardware never changed; the fixed cost just stopped repeating.

The window: size and max-wait

A batch does not assemble itself for free — items have to accumulate before they ship. That accumulation is governed by a window with two knobs, and whichever fires first closes the batch:

  • Size — a count or byte cap. Kafka’s batch.size defaults to 16 KB; fill it and the batch flushes immediately.
  • Max-wait — a time cap. Kafka’s linger.ms (default 5 ms in modern versions) is the longest the producer will hold an under-full batch hoping more arrives.

Under heavy load, batches fill before the timer expires, so you ride the size cap and get near-maximum amortization for free. Under light load, the timer is what closes the batch — and that is where the cost hides. An item arriving into an empty window pays the full linger.ms of dead time even though the system is idle. Bigger windows buy more throughput per unit fixed cost but charge it to tail latency: the items at the front of the window wait the longest. Later lessons go deep on tuning this; for now, hold the shape — the window is the dial between throughput and tail latency, and the senior question is never “batch or not” but “what window keeps p99 under the SLO?”

Where NOT to batch

Batching is not free, and a senior knows the cases where it is a net loss:

  1. Rare operations. No queue depth means no items to amortize across — you just add linger.ms of pure latency to a single call. A batch of one is slower than no batch.
  2. Hard sub-millisecond SLO. If p99 < 1ms is the contract, any wait window blows it. The amortization math wins on throughput but you cannot spend the latency.
  3. Causal per-op dependency. If operation N+1’s input depends on operation N’s acknowledged result, you cannot fire them as a group — they are serial by definition.
  4. Cannot tolerate partial-batch loss. A batch is often acknowledged or lost as a unit. If one record failing must not roll back its 999 neighbors, or a crash mid-batch must not lose buffered-but-unacked items, your failure model fights the batch boundary.
Pick the best fit

A payment service writes one ledger row per transaction. Volume is ~30 writes/sec, and the SLA is 'the row is durable before we return success to the user.' A teammate proposes buffering writes into 50 ms COPY batches to cut DB load. Pick the call a senior defends.

Quiz

A pipeline does 1000 single-row INSERTs and the CPU, disk, and network all sit near idle, yet it takes 9 seconds. What is the bottleneck?

Quiz

What does increasing a batching window (larger size, longer max-wait) trade away?

Order the steps

Order the senior's reasoning before deciding to batch an operation:

  1. 1 Is the per-operation cost mostly fixed (syscall, RTT, commit) rather than variable payload?
  2. 2 Is the operation rate high enough to create queue depth to amortize across?
  3. 3 Can the producer tolerate the added wait (no hard sub-ms SLO, no causal per-op dependency)?
  4. 4 Can the failure model survive batch-granular loss/rollback?
  5. 5 Only then: pick a window (size + max-wait) that keeps p99 under the SLO
Why this works

The reason an idle-looking system can still be slow is that batching attacks latency cost, not CPU cost. A flame graph shows where CPU time goes; it is blind to a thread parked waiting for a round-trip. When utilization is low but throughput is bad, suspect serial fixed costs — and reach for a batch before you reach for bigger hardware.

Recall before you leave
  1. 01
    In one paragraph: explain why batching exists and where to use it versus where not to.
  2. 02
    What are the two dimensions of a batching window, and what closes the window?
  3. 03
    Why can a system look completely idle (idle CPU, disk, network) and still be slow, and why does batching fix it?
Recap

Batching exists to amortize the fixed cost of an operation — the syscall transition, network round-trip, ACK, transaction commit, or log flush — across many items, turning N*(F+V) into F+N*V. It pays off when fixed cost dominates variable cost, when the rate is high enough to create queue depth, and when there’s latency slack to spend. The window has two knobs, size and max-wait, and whichever fires first closes the batch: under load you ride the size cap, under light load the timer closes it and charges the wait to tail latency. Don’t batch rare operations, hard sub-ms SLOs, causally dependent operations, or systems that can’t survive partial-batch loss. The recurring senior trap is optimizing throughput nobody is paying for while breaking a latency or durability contract — so tune the window to the SLO, not to maximum throughput.

Connected lessons
appears again in260
Continue the climb ↑The batching window: size and wait time
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources5
expand
  1. 01
  2. 02
  3. 03
  4. 04
  5. 05

Trademarks belong to their respective owners. Editorial reference only.