Performance PERF · 06 · 01

Batching: amortize fixed cost per operation

When per-operation price is mostly fixed (syscall, network round-trip, ACK, log flush), grouping operations amortizes it. The window — size plus max-wait — decides where throughput meets tail latency.

PERF Junior ◷ 8 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A team chases a slow ingest pipeline for two weeks. The CPU is idle, the network is idle, the disk is idle — yet 1000 rows take 9 seconds to insert. Nothing is “busy.” The profiler finally points at the obvious: 1000 separate INSERT statements, 1000 separate round-trips to Postgres, each one waiting for an ACK before the next begins. They swapped the loop for one COPY. Same hardware, same rows: 14 milliseconds. They never made anything faster — they just stopped paying the toll a thousand times.

By the end of this lesson you will know exactly which cost a batch removes, why idle hardware can still be slow, and when batching makes things worse.

The fixed-vs-variable cost model

Every operation has two prices. The variable price scales with the data: serializing a 4 KB payload costs more than a 40 byte one. The fixed price is paid per call no matter how small the payload — and it is usually the expensive half. A syscall pays the user→kernel mode transition and back. A network call pays a full round-trip plus the ACK. A database write pays query planning, a transaction commit, and a WAL flush. A log line pays an fsync.

Write the total cost of N operations as total = N * (F + V), where F is fixed and V is variable per item. When F dominates V and N is large, almost everything you pay is F, repeated N times — and the data itself is rounding error. Batching changes the shape to total = F + N * V: you pay F once and V per item. The fixed cost is amortized across the whole batch. That single algebraic move — pulling F outside the loop — is the entire idea.

This is why the slow pipeline was slow. The CPU, disk, and network all looked idle because the bottleneck was latency, not utilization: each operation spent its life waiting for the previous round-trip to complete. The fixed cost was never CPU time you could see in a flame graph — it was dead time on the wire and in the kernel.

Fixed cost lives at every layer

Why does batching appear at every layer of the stack? Because every layer has its own per-operation toll, and each one is a separate opportunity to pay it once instead of N times. The reason batching shows up everywhere — TCP’s Nagle algorithm, Kafka’s linger.ms, Postgres COPY, Redis pipelining, io_uring submission queues, syslog buffering — is that every layer of the stack has its own per-operation toll. Knowing which fixed cost you are amortizing tells you how big the win will be.

Layer	Fixed cost per op	Batched as	Reported gain
Syscall	user↔kernel mode transition (~hundreds of ns each)	`io_uring` / `writev` / batched submission	millions of IOPS without per-op syscall
Network (Redis)	full RTT + ACK, paid per command	pipelining (send many, read replies once)	10k PINGs: 1.19s → 0.25s (~5x)
Broker (Kafka)	produce request + replication ACK	`batch.size` + `linger.ms`	~8k → ~150k msg/s with batching on
Database (Postgres)	parse + plan + commit + WAL flush	`COPY` / multi-row `INSERT`	10M rows: 9000s of single INSERTs → 14s COPY

The Redis case is the cleanest illustration of the model. Over a 250 ms link, a server that can serve 100k requests/sec is capped at 4 requests/sec if the client waits for each reply — because the bottleneck is the RTT, paid per command. Pipeline the commands and you pay one RTT for the whole batch: throughput jumps back toward the server’s real ceiling. The hardware never changed; the fixed cost just stopped repeating.

Same hardware — only the algebra changes: N×(F+V) becomes F+N×V, and each layer's speedup tracks how much its fixed cost F dominated.

The window: size and max-wait

A batch does not assemble itself for free — items have to accumulate before they ship. That accumulation is governed by a window with two knobs, and whichever fires first closes the batch:

Size — a count or byte cap. Kafka’s batch.size defaults to 16 KB; fill it and the batch flushes immediately.
Max-wait — a time cap. Kafka’s linger.ms (default 5 ms in modern versions) is the longest the producer will hold an under-full batch hoping more arrives.

Under heavy load, batches fill before the timer expires, so you ride the size cap and get near-maximum amortization for free. Under light load, the timer is what closes the batch — and that is where the cost hides. An item arriving into an empty window pays the full linger.ms of dead time even though the system is idle. Bigger windows buy more throughput per unit fixed cost but charge it to tail latency: the items at the front of the window wait the longest. Later lessons go deep on tuning this; for now, hold the shape — the window is the dial between throughput and tail latency, and the senior question is never “batch or not” but “what window keeps p99 under the SLO?”

Where NOT to batch

Batching is not free, and a senior knows the cases where it is a net loss:

Rare operations. No queue depth means no items to amortize across — you just add linger.ms of pure latency to a single call. A batch of one is slower than no batch.
Hard sub-millisecond SLO. If p99 < 1ms is the contract, any wait window blows it. The amortization math wins on throughput but you cannot spend the latency.
Causal per-op dependency. If operation N+1’s input depends on operation N’s acknowledged result, you cannot fire them as a group — they are serial by definition.
Cannot tolerate partial-batch loss. A batch is often acknowledged or lost as a unit. If one record failing must not roll back its 999 neighbors, or a crash mid-batch must not lose buffered-but-unacked items, your failure model fights the batch boundary.

Pick the best fit

A payment service writes one ledger row per transaction. Volume is ~30 writes/sec, and the SLA is 'the row is durable before we return success to the user.' A teammate proposes buffering writes into 50 ms COPY batches to cut DB load. Pick the call a senior defends.

Quiz

A pipeline does 1000 single-row INSERTs and the CPU, disk, and network all sit near idle, yet it takes 9 seconds. What is the bottleneck?

Quiz

What does increasing a batching window (larger size, longer max-wait) trade away?

Order the steps

Order the senior's reasoning before deciding to batch an operation:

1 Is the per-operation cost mostly fixed (syscall, RTT, commit) rather than variable payload?
2 Is the operation rate high enough to create queue depth to amortize across?
3 Can the producer tolerate the added wait (no hard sub-ms SLO, no causal per-op dependency)?
4 Can the failure model survive batch-granular loss/rollback?
5 Only then: pick a window (size + max-wait) that keeps p99 under the SLO

▸Why this works

The reason an idle-looking system can still be slow is that batching attacks latency cost, not CPU cost. A flame graph shows where CPU time goes; it is blind to a thread parked waiting for a round-trip. When utilization is low but throughput is bad, suspect serial fixed costs — and reach for a batch before you reach for bigger hardware.

Per-op: N * (F + V) — the fixed cost F repeats N times. Batched: F + N * V — F is paid once, amortized across the batch.

Recall before you leave

01
In one paragraph: explain why batching exists and where to use it versus where not to.
02
What are the two dimensions of a batching window, and what closes the window?
03
Why can a system look completely idle (idle CPU, disk, network) and still be slow, and why does batching fix it?

Recap

Batching exists to amortize the fixed cost of an operation — the syscall transition, network round-trip, ACK, transaction commit, or log flush — across many items, turning N*(F+V) into F+N*V. It pays off when fixed cost dominates variable cost, when the rate is high enough to create queue depth, and when there’s latency slack to spend. The window has two knobs, size and max-wait, and whichever fires first closes the batch: under load you ride the size cap, under light load the timer closes it and charges the wait to tail latency. Don’t batch rare operations, hard sub-ms SLOs, causally dependent operations, or systems that can’t survive partial-batch loss. The recurring senior trap is optimizing throughput nobody is paying for while breaking a latency or durability contract — so tune the window to the SLO, not to maximum throughput. Now when you see an idle-looking system that is still slow, your first question is: what fixed cost is being paid serially, and can a batch amortize it?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Distributed rate limiterBuild a token-bucket limiter that holds across many app instances by keeping the counter in Redis, not in process memory.Crash-safe key-value store with a WALBuild a tiny on-disk KV store that survives a kill -9 mid-write by appending to a write-ahead log before touching the main file.