Backend Architecture
Async vs blocking: unfreeze the loop
Reading about a frozen loop is not the same as pulling a service out of one. Build a small server that blocks itself in three different ways, watch a trivial health check time out, then apply the unit’s fixes — offload, bound, backpressure — until the tail comes back, with evidence at every step.
Turn the unit’s mental model into a reproducible engineering loop: instrument event-loop lag and the tail, reproduce a self-inflicted freeze, move CPU work off the loop, bound the fan-out, and verify with before/after numbers under identical load.
Take a deliberately loop-blocking HTTP service (your own or a starter) with a trivial /health endpoint, drive it into event-loop lag and tail-latency collapse, then apply the unit's fixes — offload CPU, bound concurrency, honour backpressure — to keep p99 health latency under target and event-loop delay p99 under ~50 ms at sustained load, proving each step with measurements.
- A before/after table per route: event-loop delay p99, ELU, request p99 and p99.9, and in-flight concurrency — measured under the same load, not estimated.
- A demonstration that /health stays fast (p99 under target) while every offender is hammered, proving the loop no longer head-of-line-blocks the whole process.
- Event-loop delay p99 holds under ~50 ms at sustained load and the freeze signature is gone from the lag histogram.
- A one-paragraph write-up naming the fix used for each offender (async API vs worker thread vs concurrency cap vs backpressure) and why it ranked above tuning UV_THREADPOOL_SIZE or adding cores.
- Add a ReDoS offender (a catastrophic-backtracking regex on a user-controlled field), show one crafted request freezes the loop for seconds, then fix it with a safe regex / input validation / match timeout and prove the freeze is gone.
- Add a one-page on-call runbook: triage from the four panels, the question 'is this span bounded and fast or could it run tens of ms on a big input?', the offload-vs-chunk-vs-bound decision, and a verification checklist.
- Run the service under cluster (or multiple instances behind a load balancer) and show how 'one loop is one core' changes the saturation point and tail under the same load.
- Reproduce the same blocking workload on a second runtime (Go goroutines or Java virtual threads) and compare how the identical CPU span and fan-out manifest under a preemptive, multicore scheduler.
This is the loop you will run in every real freeze incident: instrument event-loop delay and the tail first, reproduce the self-inflicted block, then fix at the right layer — async API or worker thread for CPU work, a concurrency cap for fan-out, pipeline backpressure for streams — never a bigger libuv pool for JS CPU and never more cores for one loop. Verify with before/after numbers under identical load, with a trivial /health route as the canary. Doing it once on a toy service makes the production version muscle memory.