Performance PERF · 06 · 04

io_uring and observability of batching

Syscalls cost 1-5µs each; at 100k ops/s that''''s 100-500ms/s burned in transitions. io_uring''''s shared rings remove per-op cost. Then four metrics — batch size, wait, depth, drops — tell you if the batcher is healthy or quietly dropping data.

PERF Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

The flamegraph from the ingest service made no sense. CPU was pinned at 90%, but no business function showed up hot. The widest band, eating a third of every core, was entry_SYSCALL_64 — the cost of entering and leaving the kernel, repeated millions of times a second. The service wrote each log line with its own write(). It wasn’t slow because of what it did. It was slow because of how often it crossed the boundary.

The syscall is a wall, and you keep paying to cross it

Traditional POSIX I/O is one syscall per operation: read(), write(), recv(), send(). Each one is a controlled trap into ring 0 — the CPU saves user registers, switches the page-table and stack to kernel context, runs the handler, then unwinds the whole thing on the way out. That round trip costs roughly 1-5µs, and that’s before the syscall does any actual work. It’s pure overhead, paid per call.

The arithmetic is brutal at scale. A service handling 100k I/O ops/s spends 100,000 × 1-5µs = 100-500ms of every wall-clock second just transitioning — a tenth to half a core gone before a single byte moves. Push to 1M ops/s and traditional read/write can burn a full millisecond per second on transitions alone, plus the cache pollution from blowing away the L1/L2 working set on every crossing. This is exactly the flamegraph in the hook: the work was cheap, the boundary was not.

The classic fix is to cross less often. Buffer many small writes and flush them as one big writev(); one syscall now carries a thousand records. That’s the whole game of batching at the syscall layer: amortize a fixed per-crossing cost over a variable payload. The next step removes most of the crossings entirely.

io_uring: stop crossing the boundary at all

io_uring (Linux 5.1+) replaces “one syscall per op” with two ring buffers mmap’d into memory shared between user space and the kernel:

Submission Queue (SQ) — userspace writes operation descriptors (SQEs: read this fd, write this buffer) into a slot, then advances a tail pointer.
Completion Queue (CQ) — the kernel writes results (CQEs) back into a slot and advances its tail.

In the basic mode you still call io_uring_enter() to tell the kernel “I queued N ops” — but that’s one syscall for the whole batch instead of N. The dramatic mode is IORING_SETUP_SQPOLL: the kernel spawns a thread that continuously polls the SQ tail. Userspace submits work by writing memory and bumping a pointer, and the kernel thread picks it up on its own — zero syscalls per op. (One catch worth knowing: if the SQPOLL thread idles past sq_thread_idle, it sleeps and sets IORING_SQ_NEED_WAKEUP; you then owe one io_uring_enter() to wake it. So zero-syscall holds under sustained load, not on a trickle.)

Approach	Syscalls per 1M ops	Transition cost / s	Catch
One `write()` per op	1,000,000	~1-5 ms (0.1-0.5 core)	Cache thrash on every crossing
Buffer + `writev()`	~1,000 (batch=1k)	~1-5 µs	Adds wait latency before flush
io_uring (one enter/batch)	~1,000	~1-5 µs	More complex API; CQE reaping
io_uring + SQPOLL	~0 (under load)	~0	Burns a poller core; needs privilege

Each batching step removes more user↔kernel crossings: one write() per op burns 1,000,000 syscalls per 1M ops, a buffered writev()/io_uring_enter() batch cuts it to ~1,000, and SQPOLL polling drives it to ~0 under sustained load.

Beyond removing crossings, io_uring unlocks patterns plain syscalls can’t express:

Linked operations (IOSQE_IO_LINK) — chain SQEs so the next one runs only after the previous completes, e.g. accept → read → write submitted as one dependent unit.
Provided/registered buffers — pre-register a buffer pool once; the kernel selects a free buffer per op instead of you registering one each time.
Fixed files — pre-register fds so the kernel skips the per-syscall descriptor-table lookup.

Adoption is now mainstream, not experimental. PostgreSQL 18 (released Sep 2025) shipped async I/O with three io_method modes — sync, worker (the default), and io_uring — where the io_uring backend cuts syscall overhead on cold-cache sequential and bitmap scans (benchmarks report 2-3x throughput gains in cloud-storage scenarios). Note the default is worker, not io_uring, precisely because of the dependency and security concerns below. On the networking side, io_uring shaves single-digit-to-low-double-digit CPU off TLS-proxy and high-fanout socket workloads (the socket layer is where epoll-based proxies spend 70-80% of cycles outside userspace), which is why low-overhead-proxy teams reach for it.

▸Why this works

Why isn’t io_uring the default everywhere if it’s faster? Security. It has been one of the most exploited kernel subsystems — CVE-2023-2598 (out-of-bounds access) and CVE-2024-0582 (use-after-free in buffer-ring registration) are both local privilege-escalation bugs with public exploits. Google reported that ~60% of kernel exploits submitted to its 2022 bug bounty targeted io_uring, and disabled it by default in several environments. The containerd default seccomp profile and GKE block the io_uring syscalls outright. So in a hardened container, your beautiful zero-syscall design may simply return EPERM. Always have a fallback path to epoll/threads.

You rarely call io_uring directly — your runtime batches for you

Most services never touch the raw rings; they lean on a runtime primitive that buffers in userspace and flushes as one crossing. The shapes rhyme across languages:

Node.js — stream.cork() buffers writes in memory; uncork() (deferred via process.nextTick) flushes them as a single _writev() — but only if the stream implements _writev; corking a stream without it can hurt. Pair with backpressure via the write() return value.
Go — bufio.Writer coalesces small writes; combine with a time.Ticker to flush on a max-wait, giving the classic size-or-time window.
Java — BufferedOutputStream accumulates until its buffer fills or you flush().
Python — asyncio.Queue feeding a consumer that drains in chunks (get until empty or count cap).
Rust — tokio::sync::mpsc channels with a batching loop (recv_many / drain-and-flush on a tick).

Every one of these is the same contract: a bounded buffer with a max-size trigger, a max-wait trigger, and an explicit flush. And every one of them is a place data can silently pile up or get dropped — which is why you instrument it.

The four metrics that tell you the batcher is healthy

You have a batcher running in production. How do you know it is actually helping and not quietly dropping data? Without instrumentation, a buffer overflow looks exactly like healthy throughput — until your tail-latency metrics start disappearing because the events that carried them were silently discarded.

A batcher is a tiny queue with a flush policy, and like any queue it can fill, stall, or overflow without throwing an error. Production-grade observability tracks four per-batch metrics; together they let you tune the window and catch backpressure before it becomes data loss.

Metric	Type	What it reveals	Acts on
Batch-size histogram	Histogram (p50/p99, records & bytes)	Filling to max (good) vs flushing on timer (window too small / traffic light)	Tune max-size / max-wait
Batch wait time	Histogram (latency)	How long an item sat before shipping — your latency tax	Check against SLO; shrink window
Buffer-depth gauge	Gauge (current items / % cap)	Sustained spikes = downstream can’t keep up (backpressure building)	Alert at > 80% cap; scale/slow producer
Drop count	Counter	Items discarded on overflow — should be 0; nonzero = you are losing data	Page on `drops > 0`

The failure mode that hides without these is the quiet drop. Facebook’s Scribe log-delivery system is the canonical war story: a buffered, batching pipeline that — under downstream pressure — must choose between blocking the producer (back up the whole app) or dropping messages. If you only watch throughput, a bursty downstream looks fine right up until the buffer overflows and your tail latency p99 metrics start vanishing from the dashboard because the events that carried them got dropped. The dashboard says “healthy” because the survivors look healthy. The senior reflex: buffer depth and drop count are leading indicators; throughput is a lagging one. Alert on drops > 0 and depth > 80% of cap, and the overflow becomes a page you answer, not an incident you reconstruct.

▸Why this works

“Drop count should be zero” sounds obvious, but the deeper point is what a nonzero drop means. It is never a tuning nuance — it is the buffer’s last-resort signal that the producer is outrunning the consumer and the bounded queue chose to shed load rather than grow without limit (which would be OOM). A single drop is a backpressure event. Treat one as you would a dropped database write.

Quiz

A batching writer's batch-size histogram shows almost every flush is far below the configured max size, and batch wait time sits at exactly the max-wait value. What does this tell you?

The senior tradeoff: how aggressively to batch

Bigger batches and longer windows save more syscalls and CPU, but every item now waits longer before it ships — directly inflating tail latency. The whole skill is choosing the window against your SLO, and proving the choice with the metrics above rather than guessing.

Pick the best fit

A telemetry ingest path does ~200k small writes/s and is CPU-bound on syscall transitions. p99 end-to-end latency SLO is 200ms. Pick the approach a senior defends.

Order the steps

Order the diagnosis steps when a batching ingest path starts losing data under load:

1 Check drop count — nonzero means the bounded buffer is shedding load
2 Look at the buffer-depth gauge — pinned near cap confirms the buffer is overrun
3 Inspect batch wait time / size — is the flush policy keeping up or stalling?
4 Find the bottleneck downstream (the consumer that can't drain fast enough)
5 Apply backpressure or scale the consumer; only then retune the window

Per-op crossings pay the ~1–5µs trap N times; one batched writev (or io_uring_enter) carries all N records across the boundary once. SQPOLL polls the ring for near-zero crossings under load.

Recall before you leave

01
How does io_uring eliminate per-call syscall overhead, and what's the catch with SQPOLL mode?
02
What are the four batching observability metrics, and which two are leading indicators of trouble?

Recap

A syscall costs ~1-5µs in pure transition overhead, so at 100k-1M ops/s a service can burn a tenth to a full core just crossing the kernel boundary. The first fix is to cross less: buffer many small writes and flush as one writev(). io_uring goes further with two mmap’d rings (submission + completion) — one io_uring_enter() submits a whole batch, and SQPOLL mode lets a kernel thread poll the queue for near-zero syscalls under load, at the cost of a poller core, privilege, and seccomp blocking in hardened containers. PostgreSQL 18’s io_uring backend and low-overhead TLS proxies are real adopters, but io_uring’s CVE history (and Google’s 60%-of-exploits finding) is why worker mode is Postgres’s default. In practice you batch through a runtime primitive — Node cork()/uncork(), Go bufio.Writer + ticker, Java BufferedOutputStream, Python asyncio.Queue, Rust tokio mpsc — each a bounded buffer with max-size, max-wait, and flush. Instrument all four metrics: batch size (filling vs timer), wait time (latency tax), buffer depth and drop count (the leading backpressure signals). Throughput lags; depth and drops lead. Page on drops > 0 and depth > 80% of cap, and a silent overflow becomes an alert instead of an archaeological dig. Now when you see entry_SYSCALL_64 eating a surprising share of your flame graph, you know the fix is not faster code — it is fewer crossings.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Crash-safe key-value store with a WALBuild a tiny on-disk KV store that survives a kill -9 mid-write by appending to a write-ahead log before touching the main file.