Browser & Frontend Runtime
Worker pools, Comlink, and production observability
You moved the 400 ms image job to a worker. Then users start clicking rapidly and the page accumulates fifty pending jobs, each spawning a new worker that never gets cleaned up. The browser tab now uses 400 MB and is being killed on mobile.
When workers pay off — and when they do not
Spawning a worker is not free: a new realm, a new event loop, a fresh copy of any imported scripts — typically 5–20 ms of startup plus a few MB of memory. A worker is worth it when:
- The task is long (tens of ms or more), so startup is amortised.
- The result-transfer cost is small relative to the compute.
It is a net loss when the task is short — the postMessage round-trip and clone overhead can exceed the work itself.
For repeated small jobs: maintain a worker pool rather than spawning per job.
The worker pool
A pool amortises startup cost and bounds memory:
pool size = navigator.hardwareConcurrency − 1 // leave one core for main threadComponents:
- N workers, created once and reused.
- A job queue — pending work waiting for a free worker.
- A dispatcher — on job arrival: if a worker is free, send it; otherwise enqueue.
The pool pattern eliminates repeated startup cost and bounds total worker memory.
Backpressure. Without a queue cap, the queue grows unboundedly under load. Two options:
- Drop the oldest enqueued jobs.
- Reject new jobs (return a rejected promise) so callers can back off.
Priority routing. Not all jobs are equal — a filter preview the user is waiting for right now outranks background thumbnail generation. Use a priority-tagged queue: interactive jobs skip background ones.
Cancellation. If the user scrolls away before a job completes, cancel it. For a queued job: remove it. For an in-progress job: the clean way is an atomic cancellation flag in a SharedArrayBuffer — the worker checks it periodically and exits early, keeping itself alive for the next job. The blunt way is worker.terminate() — but that destroys the worker, forcing a new spawn for the next job (back to paying startup cost).
- Pool size rule of thumb
- hardwareConcurrency − 1
- Worker startup
- 5–20 ms + script parse
- postMessage task-hop latency
- ~1 ms each direction
- Empty worker memory
- A few MB
- Worker + large library
- Tens of MB
- Leaked worker (forgotten terminate)
- Persists until tab close
The DOM-in-a-worker mistake
The single most common worker mistake is architectural: teams reach for a worker to “speed up rendering” and discover the worker cannot touch the DOM at all. A worker can compute what should render but never render it — the result must be posted back and applied by the main thread.
If the bottleneck is DOM mutation itself (10 000 nodes inserted, a giant React reconciliation), a worker does not help — the expensive part has to happen on the main thread regardless. Workers help when the bottleneck is pure computation that produces a small result:
- Parse 5 MB JSON (worker) → post back 200-row array (cheap) → main thread renders 200 rows (fast). ✓
- React reconciliation of 10 000 nodes (bottleneck is the commit, not derivation) — worker cannot help. ✗
The exception: OffscreenCanvas. Canvas rendering can be done from a worker. Transfer an OffscreenCanvas and the worker draws 2D or WebGL entirely off the main thread.
Comlink and the RPC illusion
Comlink makes await worker.heavyCompute(data) look like a local call by wrapping a worker in a Proxy. This is ergonomic, but the abstraction hides two costs that still matter:
- Every argument is structured-cloned unless explicitly wrapped with
Comlink.transfer. - Every call is a task hop each way — a round-trip message between threads.
The illusion breaks for chatty interfaces — a worker API with many small methods called in a tight loop pays a task hop per call and serialises the program on the round-trips. Design worker interfaces coarse: one call that does a batch of work and returns a batch of results, not many fine-grained calls. Same principle as network API design: minimise round-trips, maximise work per round-trip.
Production observability
Each worker and each service worker is a separate context in DevTools. Web workers appear in the Sources panel thread list. Service workers have a dedicated panel under Application → Service Workers.
Telemetry across threads:
- Instrument both sides of every postMessage with timestamps. Measure real task-hop latency and clone cost in production — local dev on a fast machine systematically understates both.
- Track service worker
fetch-handler duration. A slow handler delays every navigation and resource load on the page, and because it runs before the main thread sees the response, a regression there is invisible to ordinary main-thread profiling.
Worker leak detection:
- A component that creates a worker in
useEffectwithout terminating in the cleanup function leaks a worker on every remount. After a dozen navigations, you accumulate a pool of dead workers that nobody created intentionally. - Profile with DevTools → Performance → Threads to see all active workers. Unexpected idle threads are leaks.
You need to run a 400 ms image-processing job triggered by a button click, without freezing the page. Pick the approach.
Design the threading architecture for a browser-based video editor: a 4K timeline, real-time filter previews, and export. It must stay at 60 fps during scrubbing and never freeze the UI.
- Main thread reserved for DOM, input, and the timeline UI only.
- Filter previews must update within 100 ms of a parameter change.
- Export of a multi-minute clip must not block the UI and must show progress.
- Large frame buffers must cross threads without per-frame clone cost.
- The app must load instantly on repeat visits and survive a page reload mid-edit.
- Multithreaded WASM is used for the codec.
- Reserve the main thread for DOM/input; all pixel work goes to workers.
- Move frame buffers as transferables or via SharedArrayBuffer — never by-value clone.
- OffscreenCanvas lets canvas rendering itself run off the main thread.
- Multithreaded WASM needs cross-origin isolation: COOP + COEP, with every cross-origin asset serving CORP.
- A service worker gives instant repeat loads; IndexedDB checkpoints survive a reload.
Why this works
Why is navigator.hardwareConcurrency − 1 the pool size rule? Using all N cores for workers starves the main thread — rendering, input, and your JS all run there. Leaving one core free for the main thread keeps 60 fps animation and input handling smooth while the worker pool runs at full utilisation. On a device with 2 cores the pool is 1 worker; on an 8-core machine it is 7. This is the same reasoning as leaving one CPU for the OS scheduler in server deployments. The − 1 is heuristic, not law — workloads with very short tasks may benefit from a smaller pool (less contention per core); workloads with I/O-bound workers may benefit from a larger one. Profile first.
- 01A teammate proposes moving a slow React re-render into a web worker to fix jank. Explain why this will not work and what actually will.
- 02What is the Comlink task-hop problem and how do you design around it?
- 03How do you detect and prevent worker leaks in a React application?
Worker pools amortise the 5–20 ms startup cost and bound memory — size the pool to navigator.hardwareConcurrency − 1. Add backpressure (cap the queue, reject or drop when full) and priority routing (interactive jobs before background). Comlink removes postMessage boilerplate but hides clone and task-hop costs — keep worker APIs coarse to minimise round-trips. Workers cannot help DOM mutation — only pure computation. In production, instrument postMessage timestamps on both sides to measure real latency; track service worker fetch-handler duration as it is invisible to main-thread profiling; watch for worker leaks in SPA components.
appears again in41
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- The twelve layers: one URL, seven actorsjunior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What is OpenTelemetry: API, SDK, Collector, OTLPjunior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- What is trace propagation and why broken propagation is worse than nonejunior
- traceparent and tracestate: the W3C header format in fullmiddle
- Baggage and async boundaries: carrying context across queues and callbacksmiddle
- Async context per language, service mesh, B3 migration, and securitysenior
- Production propagation failures, span links, and platform designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- The incident loop: from pager to postmortem to preventionmiddle
- Scale, security, and the ROI of observable systemssenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior