Backend Architecture BE · 03 · 01

Blocking vs non-blocking I/O: two ways to wait

A server spends most of its life waiting on I/O. Blocking I/O parks a whole thread on each wait, so concurrency costs memory; non-blocking I/O hands the wait to the kernel and lets one thread juggle thousands of sockets through an event loop.

BE Junior ◷ 11 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

Time a typical request handler and the surprise is how little of it is your code. It reads a row from Postgres, calls a payment API, writes a log line — and spends 95% of its wall-clock time doing nothing but waiting for those to come back. The whole game of backend concurrency is: what does the program do while it waits? Two answers split the entire field. One parks a thread on every wait. The other refuses to park anyone and asks the kernel to tap it on the shoulder when data is ready.

Waiting is the job

A backend is mostly an I/O machine. Disk reads, database queries, outbound HTTP, socket writes — each is slow relative to the CPU (microseconds to milliseconds, while the CPU runs billions of instructions a second). So the design question is never “how fast is my code” first; it is “how does the runtime spend the wait.” Two I/O models give opposite answers, and the choice shapes how the server scales, how much memory it eats, and how it fails under load.

Blocking I/O: one thread per connection

In the blocking model, a thread calls read() and the operating system suspends that thread until bytes arrive. The thread is parked — consuming its stack and a scheduler slot — doing nothing useful. To serve a second connection concurrently you need a second thread, a third needs a third, and so on: thread-per-connection.

This is simple and easy to reason about — the code reads top to bottom, each line waits for the last — but it scales by adding threads, and threads are not free. Each OS thread reserves roughly 1–2 MB of stack, so 10,000 concurrent connections imply on the order of 10+ GB of memory just for stacks, plus thousands of context switches per second as the scheduler shuffles parked threads. The model trades memory for simplicity.

At ~1–2 MB of stack per thread, memory grows linearly with connections — 10k reaches 10+ GB of stacks. This is the C10k wall the event loop sidesteps.

Non-blocking I/O: one thread, many sockets

In the non-blocking model, a socket is set to non-blocking mode and read() returns immediately — either with data or with “not ready yet.” Instead of parking, the thread registers interest in many sockets with a kernel facility — epoll on Linux, kqueue on BSD/macOS — and asks one question: “which of these thousands of file descriptors are ready right now?” The kernel returns only the ready ones, in roughly O(1) time regardless of how many are being watched. The thread services those, then asks again. That loop is the event loop.

One thread can therefore drive tens of thousands of connections, because it only ever touches sockets that have actual work. The cost is a different shape of code: you cannot read top-to-bottom and “wait” — you register a callback (or await) and the loop calls you back later. Logic that was a straight line becomes a set of continuations.

▸Why this works

Why does the kernel facility matter so much? The naive way to watch many sockets is to loop over all of them asking “ready? ready? ready?” — that is select/poll, and it costs O(n) per pass, so watching 10,000 sockets means scanning 10,000 every time even if one is ready. epoll/kqueue invert this: you register the set once, and the kernel hands back only the descriptors that became ready, so the cost tracks the number of active connections, not total connections. This is the mechanism that makes “one thread, 50,000 idle keep-alive connections” actually cheap — the idle ones cost almost nothing because the loop never visits them until they have data.

The C10k framing and the real tradeoff

This split was named by the C10k problem (~1999): how do you serve 10,000 concurrent clients on one box? Thread-per-connection hit a memory and context-switch wall; the event-loop model — Nginx, Node.js, Netty, Redis — was the answer. The honest summary:

Blocking / thread-per-connection trades memory and context-switch overhead for simplicity. Great when connection counts are modest or work is CPU-heavy; the code stays linear.
Non-blocking / event loop trades code complexity (callbacks, continuations, no parking) for scalability under many concurrent, mostly-idle connections.

Neither is universally “faster.” For I/O-bound workloads with high concurrency, the event loop wins decisively on memory and connection count. For CPU-bound work, a single event-loop thread is no faster than any other single thread — a limit the next lessons make sharp.

	Blocking (thread-per-connection)	Non-blocking (event loop)
Waiting	Thread parked by OS	Kernel watches FDs, thread moves on
10k connections	~10+ GB stacks, many context switches	One thread, memory ~ active conns
Code shape	Linear, top-to-bottom	Callbacks / `await`, continuations
Scales by	Adding threads	Adding ready-event throughput
Best for	Modest concurrency, CPU-heavy	High concurrency, I/O-bound

Quiz

Why does a thread-per-connection server struggle to hold 50,000 mostly-idle keep-alive connections?

Quiz

What does `epoll`/`kqueue` give the event loop that a naive `select`/`poll` scan does not?

Order the steps

Order what a non-blocking server does to serve a read on one of many sockets:

1 Set the socket to non-blocking mode and register it with epoll/kqueue
2 Ask the kernel which of the watched descriptors are ready
3 Kernel returns only the ready descriptors
4 Run the callback for each ready socket, reading the available bytes
5 Loop back and ask the kernel again

Blocking parks the thread on every wait (~1–2 MB each, ~10 GB for 10k conns); non-blocking registers with epoll/kqueue, returns only ready descriptors, and one thread services thousands.

Recall before you leave

01
Why is 'how the runtime spends the wait' the central question for a backend, rather than raw code speed?
02
How does blocking thread-per-connection work and what is its scaling cost?
03
How does non-blocking I/O with an event loop serve many connections on one thread, and what does epoll/kqueue contribute?

Recap

A backend spends most of its life waiting on I/O, so the model for how it waits decides everything downstream. Blocking I/O parks a thread on each wait: linear, easy code, but each thread costs roughly 1–2 MB and a scheduler slot, so thread-per-connection turns 10,000 connections into 10+ GB of stacks and a storm of context switches — memory traded for simplicity. Non-blocking I/O sets sockets non-blocking, returns immediately, and registers them with epoll or kqueue so one thread asks the kernel which descriptors are ready and services only those — scaling to tens of thousands of connections because idle ones cost almost nothing, at the price of callback- or await-shaped code. The C10k problem named this divide, and the event loop became the standard answer for high-concurrency I/O-bound servers. The next lesson opens that loop up: the ordered phases it runs, the microtask queue it drains between them, and why this concurrency is cooperative rather than parallel. Now when you see a thread-count limit or an epoll configuration, you know exactly which half of the tradeoff you are tuning.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Streaming and backpressure: when the client reads slower than you writesenior

unlocks

The event loop: one thread, ordered phasesmiddle

deepens into

The event loop: one thread, ordered phasesmiddle

appears again in188

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.