Backend Architecture BE · 03 · 10

Async vs blocking: unfreeze the loop

Hands-on project — build a service that blocks its own event loop, instrument the lag, then offload CPU, bound the fan-out, and prove the tail recovered with before/after numbers.

BE Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about a frozen loop is not the same as pulling a service out of one. Build a small server that blocks itself in three different ways, watch a trivial health check time out, then apply the unit’s fixes — offload, bound, backpressure — until the tail comes back, with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible engineering loop: instrument event-loop lag and the tail, reproduce a self-inflicted freeze, move CPU work off the loop, bound the fan-out, and verify with before/after numbers under identical load.

Project

0 of 7

Objective

Take a deliberately loop-blocking HTTP service (your own or a starter) with a trivial /health endpoint, drive it into event-loop lag and tail-latency collapse, then apply the unit's fixes — offload CPU, bound concurrency, honour backpressure — to keep p99 health latency under target and event-loop delay p99 under ~50 ms at sustained load, proving each step with measurements.

Requirements

Acceptance criteria

A before/after table per route: event-loop delay p99, ELU, request p99 and p99.9, and in-flight concurrency — measured under the same load, not estimated.
A demonstration that /health stays fast (p99 under target) while every offender is hammered, proving the loop no longer head-of-line-blocks the whole process.
Event-loop delay p99 holds under ~50 ms at sustained load and the freeze signature is gone from the lag histogram.
A one-paragraph write-up naming the fix used for each offender (async API vs worker thread vs concurrency cap vs backpressure) and why it ranked above tuning UV_THREADPOOL_SIZE or adding cores.

Senior stretch

Add a ReDoS offender (a catastrophic-backtracking regex on a user-controlled field), show one crafted request freezes the loop for seconds, then fix it with a safe regex / input validation / match timeout and prove the freeze is gone.
Add a one-page on-call runbook: triage from the four panels, the question 'is this span bounded and fast or could it run tens of ms on a big input?', the offload-vs-chunk-vs-bound decision, and a verification checklist.
Run the service under cluster (or multiple instances behind a load balancer) and show how 'one loop is one core' changes the saturation point and tail under the same load.
Reproduce the same blocking workload on a second runtime (Go goroutines or Java virtual threads) and compare how the identical CPU span and fan-out manifest under a preemptive, multicore scheduler.

Recap

This is the loop you will run in every real freeze incident: instrument event-loop delay and the tail first, reproduce the self-inflicted block, then fix at the right layer — async API or worker thread for CPU work, a concurrency cap for fan-out, pipeline backpressure for streams — never a bigger libuv pool for JS CPU and never more cores for one loop. Verify with before/after numbers under identical load, with a trivial /health route as the canary. Doing it once on a toy service makes the production version muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.