awesome-everything RU
↑ Back to the climb

Browser & Frontend Runtime

Service worker edge cases: version skew, durability, and navigation traps

Crux Version skew from content-unhashed assets, the kill-and-restart durability trap, and why a broken navigation-intercepting service worker is a stop-the-deploy incident.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 16 min

You ship a service worker that serves the app shell cache-first. A week later you push a bug fix. Some users never get it — they keep hitting the broken cached shell on every reload, and there is no recovery button for an ordinary user.

The update-and-version-skew problem

A service worker’s asset cache is versioned to its own code. When you deploy version N+1, an open page may be running version N’s worker while N+1’s HTML has shipped — or the reverse. If the worker serves cache-first app.js from version N while the HTML expects version N+1’s app.js, you get a runtime error from mismatched modules.

The robust pattern:

  1. Content-hash every asset filename (app.4f3a1c.js). Old and new assets coexist in the cache with no collision.
  2. Version-tag cache names (cache-v3, cache-v4). Pre-cache each deploy’s full asset set under the version tag.
  3. In activate, delete only stale caches — caches whose version tag is not the current one. Do it after clients.claim() so no controlled page loses assets mid-session.
  4. Serve navigation requests network-first (or a dedicated app-shell route), so users always land on HTML consistent with the active worker.

The same class of bug applies to sw.js itself: browsers cache the worker file for up to 24 hours by default. The modern practice is to serve sw.js with Cache-Control: no-cache so the browser always re-fetches it on navigation.

Service workers are not durable

The browser kills an idle service worker aggressively — often within seconds of finishing a fetch event — and restarts it on the next event. Any state held in a module-level variable is gone on restart. This is a frequent bug source:

  • A counter tracking requests in flight.
  • A cache of pending promises.
  • A WebSocket connection held in a global.

All evaporate. Durable state must live in IndexedDB or the Cache API.

Long-running work inside an event handler must be wrapped in event.waitUntil(promise) — that tells the browser “do not kill me until this promise settles.” Forgetting waitUntil means the browser may terminate the worker mid-operation, and background sync, push handling, and cache population silently fail to complete.

Service worker durability facts
Idle worker kill time
Seconds after last event
sw.js browser cache default
Up to 24 hours
Recommended sw.js cache header
Cache-Control: no-cache
Durable state options
Cache API or IndexedDB only
waitUntil forgets → silent fail
push, sync, cache population

The most powerful — and most dangerous — service worker pattern is intercepting navigation requests: the fetch handler catches the request for the HTML document itself and returns a cached app shell. This gives instant loads, but creates a class of bug otherwise impossible.

The trap: If you ship a bug in the app shell and cache it cache-first, every repeat visit serves the broken shell from cache, bypassing the network where the fix lives. The user cannot escape with an ordinary reload.

The defence is layered:

  1. Navigation requests should be network-first with a short timeout (~3 s, fall back to cache). This ensures a fix reaches users on their first successful load.
  2. Keep a kill switch — a versioned endpoint the worker checks on activate or periodically. On signal, call self.registration.unregister() and delete caches. This lets you remotely detach a broken service worker from all clients.
  3. Never cache navigation cache-only. Always have a network path.

A broken service worker shipped widely is a stop-the-deploy incident because ordinary users have no recovery button — they cannot open DevTools, they cannot clear site data. Your only recourse is the kill switch or a fresh deploy that the old worker fetches on next activation.

Quiz

A service worker holds an in-flight request map in a module-level `const cache = new Map()`. After a few seconds of inactivity, entries vanish. Why?

Quiz

You deploy a service worker update and some users report a broken page: scripts fail to load with module-mismatch errors. What is the most likely cause and the robust fix?

Quiz

A user is stuck on a broken cached app shell and a normal reload does not fix it. What is the recovery mechanism you should have built in advance?

Why this works

Why is a broken service worker so hard to recover from? When a service worker intercepts navigation, it sits between the browser and the server for the HTML document itself — the page cannot load without the service worker responding first. Unlike a broken CDN (where the browser falls back to origin), a broken service worker responds successfully with a broken cached response. The browser has no way to distinguish a correct cached response from a buggy one. This is why the kill switch must be proactive: a URL the worker fetches on every activate, whose response tells the worker whether to unregister itself. If you wait for users to report breakage, you have already shipped.

Recall before you leave
  1. 01
    Why does a service worker's module-level state disappear between requests?
  2. 02
    What is the version-skew failure mode in service workers and how do you prevent it?
  3. 03
    Why is a broken navigation-intercepting service worker a stop-the-deploy incident, and what is the architectural defence?
Recap

Service workers have three major edge-case failure modes. Version skew: serving cached assets from the wrong version — prevented with content-hashed filenames and version-tagged caches. Durability trap: module-level state evaporates between events because the browser kills idle workers; use event.waitUntil for long operations and IndexedDB/Cache API for state. Navigation interception: caching the HTML document itself means a broken shell traps users permanently — always use network-first for navigation and build a kill-switch endpoint. All three failures become hard-to-reverse production incidents if deployed without the safeguards.

Connected lessons
appears again in143
Continue the climb ↑Worker pools, Comlink, and production observability
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.