Caching CACHE · 08 · 01

Composing the cache stack: one coherent strategy across CDN, proxy, Redis, and the DB

A multi-layer cache is only as correct as the way its layers compose. This capstone is the senior framework: which layer owns which data, how TTLs cascade, where a purge propagates, and how each layer fails open under origin loss.

CACHE Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

On 25 December 2015, under a Christmas-load DoS attack, Valve deployed a caching config that “incorrectly cached web traffic for authenticated users.” For roughly an hour, Steam Store pages built for one logged-in user were cached at the edge and served to others: billing addresses, purchase history, the last two digits of a credit card, the last four of a Steam Guard phone number, email. Around 34,000 users were exposed before Valve took the store down. The application was correct. The bug lived in the seam between layers: an authenticated page that never said private, sitting in a shared cache that assumed it could.

Every layer is a cache; the question is who owns what

By the time a byte reaches a user it may have passed through four caches: the CDN edge, a reverse proxy (Varnish/nginx) in front of the app, an application cache (Redis), and the database’s own buffer/query cache. Each one is a correctness boundary, not just a speed trick. The senior move is to assign ownership before tuning anything: decide which layer is the source of truth for each kind of data, and let the others hold copies only with rules that respect that source.

The clean division most teams converge on:

CDN edge owns shared, public, cacheable-by-URL responses: static assets, anonymous HTML, public API reads. It must never hold anything user-specific.
Reverse proxy owns origin-shielding: collapsing duplicate requests (request coalescing) so a cold cache doesn’t dogpile the app, plus a short cache of hot public responses close to origin.
Redis owns computed application state: the expensive query result, the rendered fragment, the session — keyed by something the app controls, so the app can invalidate it precisely.
DB owns truth. Its query/buffer cache accelerates reads but is invalidated by writes automatically; you rarely tune it, you respect it.

Get ownership wrong and no TTL saves you. The Steam failure was an ownership violation: a personalized response (Redis/origin’s job) ended up owned by a shared CDN cache that had no way to know it was personalized.

TTLs must cascade downward, not upward

The single most common composition bug: an outer layer holds a longer TTL than an inner one. If the CDN’s s-maxage=3600 but the app revalidates its Redis fragment every 60s, then for up to an hour the CDN serves a version the app already considers dead. Invalidation at the inner layer is invisible to the outer layer — the edge keeps shipping the corpse.

The rule: freshness windows should shrink as you move away from the origin, or you must purge the outer layer explicitly on every inner change. Cache-Control: public, s-maxage=60, stale-while-revalidate=600 is read by the shared cache (CDN) via s-maxage; max-age targets the private browser cache; they are deliberately different numbers for different layers. Mixing them up — giving the browser a year and the CDN a minute, when you meant the reverse — is how a deploy goes live everywhere except the one cache users actually hit.

Resource	CDN (shared)	Browser (private)	Invalidation
Hashed JS/CSS asset	`public, max-age=31536000, immutable`	same — 1 year	New hash = new URL. Never purge.
Anonymous marketing HTML	`s-maxage=300, stale-while-revalidate=86400`	`max-age=0, must-revalidate`	Tag-based purge on publish.
Logged-in dashboard HTML	`private, no-store` (CDN must skip)	`private, no-cache`	Never enters a shared cache.
Public API read (price list)	`s-maxage=30, stale-if-error=3600`	`no-cache`	Surrogate-key purge on price change.

`private` is the seam where data leaks

The Steam lesson is that shared caches default to caching cacheable-looking GETs, and a 200 with no Cache-Control looks cacheable. The protection for personalized content is one token: private tells shared caches “this is for one user, do not store me” while still letting the browser cache it. For genuinely sensitive responses you escalate to no-store (nobody caches, anywhere). The dangerous middle ground is a logged-in page that omits the directive and trusts that the app delivers it straight to the client — true until a CDN, a proxy, or a misconfigured surrogate-key rule sits in the path. Developers rarely harden GET endpoints for intermediary caching because they assume direct-to-client delivery; that assumption is exactly what an edge config change can revoke without telling them.

▸Why this works

Why not just cache everything and add Vary: Cookie? Because Vary: Cookie makes the cache key include the entire cookie, so every distinct session is a separate cache entry — hit rate collapses toward zero and you’ve paid CDN cost for a private cache. Worse, one missed cookie normalization and two users share a key. For per-user content the correct answer is almost always private/no-store, not a clever Vary.

Invalidation is a propagation problem, not an event

When the source of truth changes, the change has to walk outward through every layer that holds a copy — and each layer invalidates differently. Flushing Redis does nothing to the CDN; purging the CDN does nothing to the reverse proxy still holding a stale object. Three mechanisms, in order of preference:

Versioned/immutable URLs for static assets: a new content hash is a new URL, so there is nothing to invalidate — old and new coexist and the old simply ages out. This is why hashed bundles ship immutable, max-age=31536000.
Tag / surrogate-key purge for dynamic content: tag every response with keys (product:42, category:shoes), then one purge call drops every cached response carrying that tag across the edge, regardless of URL. Cloudflare’s instant purge propagates globally in under 150ms; tag-based and broad purges typically take seconds to a few minutes to reach every edge node.
TTL + stale-while-revalidate as the eventual-consistency floor: even with no purge, the data self-heals within its freshness window.

Together these three mechanisms form a defence-in-depth: use versioning to eliminate invalidation entirely where you can, tag-based purge to propagate changes instantly when you can’t, and TTL/SWR as the safety net for anything that slips through. Skip the purge step and your users read content the database already killed — and you won’t know until someone files a bug.

Prefer the mechanism that propagates fastest: versioned URLs sidestep invalidation entirely, instant purge clears the edge in under 150ms, and broad purges plus TTL/SWR are the eventual-consistency floor.

The senior design wires the purge into the write path: the mutation that changes truth also enqueues the purge for every outer layer, in order. Skip a layer and you’ve built a system that is correct in the database and wrong at the edge — the hardest class of bug to reproduce, because it only shows up on the cached request.

Fail open: stale beats down

Composition has to survive the origin disappearing. When you set up a new layer, ask yourself: what does a user see if origin disappears right now? stale-if-error lets each shared layer keep serving its last good copy when origin returns 5xx or is unreachable — s-maxage=60, stale-if-error=86400 means “fresh for a minute, but rather than 503 the user, serve up-to-a-day-old content while origin is down.” This turns an origin outage into a soft degradation instead of a wall of errors. The tradeoff a senior weighs: how stale is acceptable per resource. A price list served an hour stale during an outage is fine; a stock-availability flag served stale can oversell. Set stale-if-error long for things that tolerate staleness, short or zero for things that must be correct or absent.

Pick the best fit

A logged-in dashboard page renders per-user data and sits behind a CDN. Pick the caching policy for that HTML response.

Quiz

Your CDN has s-maxage=3600 on a page, but the app revalidates its Redis fragment for that page every 60s. After a content edit, what does a user see?

Quiz

You want a hashed asset bundle (app.a1b2c3.js) to be cached as aggressively as possible. What's the right policy?

Order the steps

A product is edited. Order how a correct invalidation propagates outward through the stack:

1 The write commits to the DB — the source of truth changes; its query cache invalidates automatically
2 The write path deletes/updates the affected Redis keys (the computed fragment, the cached query)
3 The same path purges the reverse proxy's stale object for that resource
4 It issues a tag/surrogate-key purge to the CDN (product:42), dropping every edge copy
5 Browsers self-heal on their next revalidate; stale-while-revalidate covers the gap

Read falls inward to truth on each miss; freshness must shrink outward and a write-path purge must walk back out through Redis, proxy, and the CDN, or the edge serves a copy the DB already killed.

Recall before you leave

01
A page is correct in the database and correct in Redis, but users still see stale content. Walk through where the bug is and how the layers should have been wired.
02
Why is a missing `private` directive on an authenticated page a security bug and not just a cache-hit-rate issue, and what's the correct policy spectrum?

Recap

A cache stack is correct only when its layers compose. Start by assigning ownership: the CDN holds shared public responses, the reverse proxy shields origin and coalesces requests, Redis holds computed application state under app-controlled keys, and the DB owns truth. TTLs must shrink as you move outward, or every inner invalidation needs an explicit outer purge — otherwise the edge keeps serving content the app already killed, the bug that only appears on the cached request. Scope per-user responses with private (or no-store when sensitive), because shared caches store an un-annotated 200 by default — the seam where authenticated data leaks, as the Steam 2015 incident showed. Wire invalidation into the write path so a mutation propagates outward through Redis, proxy, and a tag-based CDN purge in order; prefer versioned/immutable URLs so static assets never need purging at all. Finally, fail open: stale-if-error lets each layer serve a last-good copy through an origin outage, turning 5xx walls into soft degradation — tuned long for resources that tolerate staleness and short for those that must be correct or absent. Now when you see a page that is correct in the database but wrong at the edge, you know exactly which layer owns the stale copy and which write-path step failed to purge it.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.