Backend Architecture BE · 03 · 04

Offloading CPU work: worker threads and the libuv pool

Two thread pools hide behind the single loop, and confusing them wastes weeks. The libuv pool runs native I/O like fs and crypto, never your JavaScript; CPU-bound JS needs worker threads or chunking, each carrying its own cost — message-copy overhead, partitioning complexity.

BE Middle ◷ 15 min

Level

FoundationsJuniorMiddleSenior

A team profiles a slow image-resize endpoint, finds the CPU-heavy work, and “fixes” it by bumping UV_THREADPOOL_SIZE from 4 to 32. Nothing improves. The resize is JavaScript running on the loop thread, and the libuv pool they just enlarged does not run JavaScript at all — it runs native I/O. They tuned the wrong pool. Node hides two execution resources behind the one event loop, and knowing which work goes to which is the difference between a fix and a week lost.

Two pools, two jobs

When you reach for a knob to fix a CPU problem, you need to know which knob actually affects which work — or you will tune the wrong thing and lose a week, exactly like the team in the hook. Behind the single loop there are two distinct sources of parallelism, and they serve opposite kinds of work:

The libuv thread pool — a small pool (default 4 threads, configurable up to 1024 via UV_THREADPOOL_SIZE) that runs native operations that have no async OS primitive: file-system calls, DNS lookups (dns.lookup), and the async crypto/zlib functions. It does not run your JavaScript. When you call fs.promises.readFile, libuv does the blocking read on a pool thread and posts the result back to the loop. This is why async bcrypt does not freeze the loop — the hashing runs on a libuv thread.
Worker threads (worker_threads) — real, separate V8 isolates with their own event loop, giving true multicore parallelism for your CPU-bound JavaScript. This is where image resizing, large-payload parsing, compression, or any heavy computation belongs.

The image-resize bug is now obvious: resizing is JS, so it needs a worker thread, not a bigger libuv pool. Enlarging libuv only helps when you are bottlenecked on fs/dns/crypto throughput — and even then, more pool threads than CPU cores mostly adds contention.

The cost of a worker: moving data

Worker threads are not free, and the bill is mostly data transfer. By default, anything you postMessage to a worker is deep-copied via the structured-clone algorithm — for a large buffer that copy is real work (a multi-MB ArrayBuffer copy was measured around 268 ms, versus about 29 ms when transferred). Two escape hatches matter:

Transferables — pass an ArrayBuffer in the transferList and ownership moves to the worker with no copy (the sender can no longer use it). Near-zero transfer cost.
SharedArrayBuffer — shared memory both threads see at once, with Atomics for safe coordination. No copy, no transfer of ownership; the right tool when both sides need the same bytes.

So the senior calculus for offloading is: the work must be CPU-heavy enough to dwarf the message-passing cost. Offloading a 2 ms computation to a worker can be slower than just running it, once you pay the clone and round-trip.

The default deep copy costs ~9x more than a transfer — which is why offloading only pays off when the CPU work dwarfs the message-passing cost.

▸Why this works

Why not spin up a fresh worker per request? Thread creation and isolate startup are expensive (tens of milliseconds and real memory per worker), so per-request workers turn a CPU problem into a thread-churn problem. The production pattern is a worker pool: create a fixed set of workers once (commonly ~one per CPU core), hand them tasks over a queue, and reuse them. This caps parallelism at the hardware that actually exists — eight cores cannot truly run nine CPU-bound tasks at once — and avoids paying startup on every call. It mirrors connection pooling: the resource is expensive to create, cheap to reuse, and dangerous to create unboundedly. Libraries like Piscina exist precisely to manage this so you do not hand-roll the queue and lifecycle.

When not to offload: chunk instead

Not every long computation needs a thread. If the work can be partitioned into small pieces, you can run a chunk, then setImmediate (or await a resolved promise) to yield the loop, then run the next chunk — letting I/O callbacks interleave between pieces. This keeps everything on the loop thread with no transfer cost, trading total throughput (the work now competes with requests) for a responsive loop. Chunking suits work that is long but interruptible (iterating a big array, paginating a computation). Worker threads suit work that is monolithic and heavy (a single resize, a crypto operation) or that you genuinely want running in parallel on another core.

And sometimes the right answer is neither: if CPU is the bottleneck across the whole service, horizontal scale — more processes (cluster) or more machines — is the lever, because one Node process maps to one loop, and CPU-bound throughput is fundamentally a core-count problem.

Approach	Runs what	True parallelism	Main cost
libuv pool (4, tunable)	Native fs/dns/crypto/zlib	Yes, for native I/O	Wrong tool for JS CPU work
Worker thread	Your CPU-bound JS	Yes, another core	Data copy / transfer + startup
Chunk + `setImmediate`	Your JS, in pieces	No (one loop)	Lower throughput, manual partitioning
Cluster / more machines	Whole process replicated	Yes, more loops	Ops complexity, shared-state coordination

Quiz

A CPU-heavy image resize (pure JavaScript) blocks the loop. Why does raising `UV_THREADPOOL_SIZE` not help?

Quiz

You offload a large buffer to a worker and find the round-trip is dominated by copying. What is the most direct fix?

Quiz

When is chunking with `setImmediate` a better choice than a worker thread?

libuv pool handles native I/O (never your JS); CPU-bound JavaScript needs a worker thread with its own V8 isolate. Both post results back to the loop as callbacks.

Recall before you leave

01
What are the two thread pools behind the event loop and what does each run?
02
What does it cost to use a worker thread, and how do you reduce that cost?
03
When should you chunk work on the loop instead of offloading, and when is neither the answer?

Recap

Two execution resources hide behind the one event loop, and offloading well means sending each kind of work to the right one. The libuv thread pool — four threads by default, tunable to 1024 — runs native I/O like fs, dns, crypto, and zlib, and never your JavaScript, so enlarging it does nothing for a JS CPU bottleneck (the image-resize trap). CPU-bound JavaScript belongs in worker threads, separate V8 isolates that deliver true multicore parallelism, but they bill you for moving data: structured-clone deep-copies by default (around 268 ms for a big buffer versus 29 ms transferred), so reach for transferables or a SharedArrayBuffer, and reuse a fixed worker pool rather than spawning one per request to avoid startup churn. When work is long but splittable, chunk it and yield with setImmediate to keep the loop responsive at the cost of throughput; when CPU is the whole-service ceiling, scale horizontally because one process is one loop. With heavy work moved off the critical path, the next lesson turns to controlling the I/O work that stays: backpressure and bounded concurrency, so the system matches its own consumption speed instead of drowning in unbounded fan-out. Now when you see UV_THREADPOOL_SIZE mentioned in a performance thread, your first check is: is the bottleneck native I/O, or is it JavaScript — because only one of those answers makes the knob relevant.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

What blocks the loop: CPU work and sync callsmiddle

unlocks

Backpressure and bounded concurrencysenior

deepens into

Backpressure and bounded concurrencysenior

appears again in188

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.