awesome-everything RU
↑ Back to the climb

Backend Architecture

Offloading CPU work: worker threads and the libuv pool

Crux Two thread pools hide behind the single loop, and confusing them wastes weeks. The libuv pool runs native I/O like fs and crypto, never your JavaScript; CPU-bound JS needs worker threads or chunking, each carrying its own cost — message-copy overhead, partitioning complexity.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 15 min

A team profiles a slow image-resize endpoint, finds the CPU-heavy work, and “fixes” it by bumping UV_THREADPOOL_SIZE from 4 to 32. Nothing improves. The resize is JavaScript running on the loop thread, and the libuv pool they just enlarged does not run JavaScript at all — it runs native I/O. They tuned the wrong pool. Node hides two execution resources behind the one event loop, and knowing which work goes to which is the difference between a fix and a week lost.

Two pools, two jobs

Behind the single loop there are two distinct sources of parallelism, and they serve opposite kinds of work:

  • The libuv thread pool — a small pool (default 4 threads, configurable up to 1024 via UV_THREADPOOL_SIZE) that runs native operations that have no async OS primitive: file-system calls, DNS lookups (dns.lookup), and the async crypto/zlib functions. It does not run your JavaScript. When you call fs.promises.readFile, libuv does the blocking read on a pool thread and posts the result back to the loop. This is why async bcrypt does not freeze the loop — the hashing runs on a libuv thread.
  • Worker threads (worker_threads) — real, separate V8 isolates with their own event loop, giving true multicore parallelism for your CPU-bound JavaScript. This is where image resizing, large-payload parsing, compression, or any heavy computation belongs.

The image-resize bug is now obvious: resizing is JS, so it needs a worker thread, not a bigger libuv pool. Enlarging libuv only helps when you are bottlenecked on fs/dns/crypto throughput — and even then, more pool threads than CPU cores mostly adds contention.

The cost of a worker: moving data

Worker threads are not free, and the bill is mostly data transfer. By default, anything you postMessage to a worker is deep-copied via the structured-clone algorithm — for a large buffer that copy is real work (a multi-MB ArrayBuffer copy was measured around 268 ms, versus about 29 ms when transferred). Two escape hatches matter:

  • Transferables — pass an ArrayBuffer in the transferList and ownership moves to the worker with no copy (the sender can no longer use it). Near-zero transfer cost.
  • SharedArrayBuffer — shared memory both threads see at once, with Atomics for safe coordination. No copy, no transfer of ownership; the right tool when both sides need the same bytes.

So the senior calculus for offloading is: the work must be CPU-heavy enough to dwarf the message-passing cost. Offloading a 2 ms computation to a worker can be slower than just running it, once you pay the clone and round-trip.

Why this works

Why not spin up a fresh worker per request? Thread creation and isolate startup are expensive (tens of milliseconds and real memory per worker), so per-request workers turn a CPU problem into a thread-churn problem. The production pattern is a worker pool: create a fixed set of workers once (commonly ~one per CPU core), hand them tasks over a queue, and reuse them. This caps parallelism at the hardware that actually exists — eight cores cannot truly run nine CPU-bound tasks at once — and avoids paying startup on every call. It mirrors connection pooling: the resource is expensive to create, cheap to reuse, and dangerous to create unboundedly. Libraries like Piscina exist precisely to manage this so you do not hand-roll the queue and lifecycle.

When not to offload: chunk instead

Not every long computation needs a thread. If the work can be partitioned into small pieces, you can run a chunk, then setImmediate (or await a resolved promise) to yield the loop, then run the next chunk — letting I/O callbacks interleave between pieces. This keeps everything on the loop thread with no transfer cost, trading total throughput (the work now competes with requests) for a responsive loop. Chunking suits work that is long but interruptible (iterating a big array, paginating a computation). Worker threads suit work that is monolithic and heavy (a single resize, a crypto operation) or that you genuinely want running in parallel on another core.

And sometimes the right answer is neither: if CPU is the bottleneck across the whole service, horizontal scale — more processes (cluster) or more machines — is the lever, because one Node process maps to one loop, and CPU-bound throughput is fundamentally a core-count problem.

ApproachRuns whatTrue parallelismMain cost
libuv pool (4, tunable)Native fs/dns/crypto/zlibYes, for native I/OWrong tool for JS CPU work
Worker threadYour CPU-bound JSYes, another coreData copy / transfer + startup
Chunk + setImmediateYour JS, in piecesNo (one loop)Lower throughput, manual partitioning
Cluster / more machinesWhole process replicatedYes, more loopsOps complexity, shared-state coordination
Quiz

A CPU-heavy image resize (pure JavaScript) blocks the loop. Why does raising `UV_THREADPOOL_SIZE` not help?

Quiz

You offload a large buffer to a worker and find the round-trip is dominated by copying. What is the most direct fix?

Quiz

When is chunking with `setImmediate` a better choice than a worker thread?

Recall before you leave
  1. 01
    What are the two thread pools behind the event loop and what does each run?
  2. 02
    What does it cost to use a worker thread, and how do you reduce that cost?
  3. 03
    When should you chunk work on the loop instead of offloading, and when is neither the answer?
Recap

Two execution resources hide behind the one event loop, and offloading well means sending each kind of work to the right one. The libuv thread pool — four threads by default, tunable to 1024 — runs native I/O like fs, dns, crypto, and zlib, and never your JavaScript, so enlarging it does nothing for a JS CPU bottleneck (the image-resize trap). CPU-bound JavaScript belongs in worker threads, separate V8 isolates that deliver true multicore parallelism, but they bill you for moving data: structured-clone deep-copies by default (around 268 ms for a big buffer versus 29 ms transferred), so reach for transferables or a SharedArrayBuffer, and reuse a fixed worker pool rather than spawning one per request to avoid startup churn. When work is long but splittable, chunk it and yield with setImmediate to keep the loop responsive at the cost of throughput; when CPU is the whole-service ceiling, scale horizontally because one process is one loop. With heavy work moved off the critical path, the next lesson turns to controlling the I/O work that stays: backpressure and bounded concurrency, so the system matches its own consumption speed instead of drowning in unbounded fan-out.

Connected lessons
appears again in185
Continue the climb ↑Backpressure and bounded concurrency
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.