Backend Architecture BE · 03 · 03

What blocks the loop: CPU work and sync calls

Cooperative concurrency has one fatal failure mode: any synchronous work on the loop thread freezes every connection at once. Sync file reads, JSON.parse on big payloads, sync crypto, and catastrophic regexes are the usual culprits; event-loop lag reveals it first.

BE Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A health-check endpoint that does nothing but return 200 OK starts timing out. Nothing touches it; its code is two lines. The real cause is three routes away: a reporting endpoint calls JSON.parse on a 40 MB payload, and for the ~800 ms that parse runs, the single loop thread is busy and every other request — including the trivial health check — sits frozen in the queue. Nobody wrote a slow health check. Someone wrote one slow synchronous line, and cooperative concurrency spread the pain to the whole process.

One slow callback stalls everyone

The last lesson’s payoff and price are the same fact: callbacks run to completion with no preemption. As long as everything yields quickly — await on I/O, return fast — thousands of connections interleave smoothly. But the moment one callback does synchronous work that takes real time, the loop cannot advance to the poll phase, cannot run any other I/O callback, cannot fire any timer. The cost is not “this request is slow.” It is head-of-line blocking for the entire process: every concurrent request pays the full duration, because they are all waiting behind the one callback hogging the thread.

This is the defining failure mode of the model. In a thread-per-connection server, one slow request slows that thread; here, one slow synchronous span slows all of them.

The usual culprits

Blocking work comes in two flavors: synchronous APIs that do I/O on the loop thread, and CPU-bound computation that simply takes too long between yields.

Sync I/O APIs — fs.readFileSync, fs.writeFileSync, child_process.execSync. These do the I/O on the loop thread and can stall it for hundreds of milliseconds (a sync read of a large file was measured around 1200 ms). The async twins (fs.promises.readFile) hand the work off and let the loop continue.
Big JSON.parse / JSON.stringify — parsing or serializing a multi-megabyte payload is pure CPU on the loop thread; a large parse was measured around 800 ms of frozen loop.
Synchronous crypto — bcrypt.hashSync at a realistic cost factor blocks roughly 200–400 ms per call; under login load that single line collapses throughput. Hashing, crypto.pbkdf2Sync, large gzipSync.
Catastrophic regex (ReDoS) — a pattern with nested quantifiers like /A(B|C+)+D/ against a crafted string can backtrack exponentially; one documented case spent ~3.7 seconds of pure CPU on a single input. Because it is on the loop thread, an attacker can freeze the whole server with one request — a denial of service.

Even routine sync calls freeze the loop for ~1 second; a single attacker-controlled ReDoS dwarfs them all at ~3.7 s — and unlike the others its cost is unbounded.

Event-loop lag: seeing it before users do

You do not need users to report timeouts to find blocking. The direct signal is event-loop lag (a.k.a. event-loop delay): schedule a timer for t ms and measure how late it actually fires. If a setTimeout(fn, 0) consistently runs 200 ms late, the loop was busy 200 ms — that lateness is the blocking, quantified. Node exposes perf_hooks.monitorEventLoopDelay() for a histogram (p50/p99 of lag), and tools like clinic.js surface it; a common production alert threshold is around 100 ms of lag.

▸Why this works

Why is event-loop lag a better health signal than CPU usage? CPU can read 100% for a perfectly healthy reason — the loop is doing useful, well-chunked work and still yielding between units. What hurts users is not CPU being busy; it is the loop failing to return to poll to service waiting sockets. Lag measures exactly that gap: the time between when a callback was due and when the loop actually got to it. A server can sit at 60% CPU with 500 ms of loop lag (one fat synchronous span repeatedly) and be far sicker than one at 95% CPU with 2 ms lag (steady, yielding work). This is why senior teams alert on event-loop delay and event-loop utilization (ELU), not just CPU — lag is the metric that correlates with the timeouts users actually feel.

The mental test

Before any line runs on the loop thread, the senior reflex is one question: is this bounded and fast, or could it run for tens of milliseconds on a big input? Reading a 2 KB config sync at startup is fine. Parsing arbitrary user-supplied JSON of unknown size, hashing a password, or matching a user-controlled string against a backtracking regex on the request path is not — those belong off the loop, which is the next lesson.

Blocking culprit	Rough frozen time	Why it blocks	Fix direction
`fs.readFileSync` (large)	~1200 ms	I/O on the loop thread	Async `fs.promises`
`JSON.parse` (multi-MB)	~800 ms	Pure CPU on loop	Stream / worker thread
`bcrypt.hashSync`	~200–400 ms/call	CPU on loop	Async bcrypt (libuv pool)
Catastrophic regex (ReDoS)	seconds, attacker-controlled	Exponential backtracking on loop	Safe regex / timeout / validate

Quiz

A trivial health-check endpoint times out whenever a reporting route runs `JSON.parse` on a 40 MB body. Why does the health check suffer?

Quiz

Why is event-loop lag often a better health signal than CPU utilization?

Quiz

Why is a catastrophic-backtracking regex on the request path a denial-of-service risk specifically in an event-loop runtime?

Head-of-line blocking: one synchronous span (fs.readFileSync ~1200 ms, JSON.parse ~800 ms, bcrypt.hashSync ~200–400 ms) prevents the loop from reaching poll, stalling every other connection for its full duration.

Recall before you leave

01
Why does one slow synchronous callback freeze the entire server rather than just its own request?
02
What are the common things that block the loop, and roughly how long do they freeze it?
03
What is event-loop lag, how do you measure it, and why is it better than CPU usage as a health signal?

Recap

The strength of cooperative concurrency — callbacks run to completion without preemption — is also its one fatal failure mode: any synchronous span on the loop thread freezes every connection at once, so a slow line three routes away can time out a two-line health check. The culprits fall into sync I/O on the loop (fs.readFileSync, around 1200 ms), heavy CPU between yields (JSON.parse of a multi-MB body near 800 ms, bcrypt.hashSync at 200–400 ms a call), and attacker-controllable catastrophic regexes that backtrack for seconds and turn one request into a denial of service. You see all of this before users do through event-loop lag — the lateness of a scheduled timer, surfaced by monitorEventLoopDelay and alerted near 100 ms — which is a truer health signal than CPU because busy-and-yielding is fine while busy-and-stalled is not. The reflex is to ask whether any line on the loop is bounded and fast or could run for tens of milliseconds on a big input; the slow ones belong off the loop entirely, which is the next lesson: worker threads, the libuv pool, and chunking CPU work. Now when you see event-loop lag spiking or a health check timing out for no obvious reason, your first question is: what synchronous span is hogging the thread?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

The event loop: one thread, ordered phasesmiddle

unlocks

Offloading CPU work: worker threads and the libuv poolmiddle

deepens into

Offloading CPU work: worker threads and the libuv poolmiddle

appears again in188

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.