Networking & Protocols NET · 07 · 07

CDN operations and observability

Cache-tag purge, multi-CDN steering, WAF at edge, the key metrics that catch CDN incidents before users notice, and a news-publisher design case study.

NET Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

You shipped a hotfix at 3 AM. Origin is updated. But 20 minutes later, 40% of users are still on the old version — different CDN regions have different cache states, a stale edge in Asia-Pacific is serving yesterday’s data, and your monitoring only shows aggregate error rates that look fine. CDN incidents are invisible until they are not.

Cache-tag purge: surgical invalidation at scale

For a site with thousands of URLs (news publisher, e-commerce catalogue), URL-based purge is operationally unmanageable. Cache tags (Cloudflare Enterprise, Fastly) solve this:

Origin sets Cache-Tag: article-1001, category-tech on article responses.
On article edit, the CMS calls POST /cdn/purge {"tag": "article-1001"}.
CDN invalidates all cached responses carrying that tag — across all POPs, within seconds.
The category tag allows purging all articles in a category (tag: category-tech) in one API call.

Without cache tags: every edit triggers O(URLs) purge calls. With cache tags: every edit triggers O(tags) calls, typically 1–5. The difference matters when you have a re-design that touches thousands of pages — without tags you’re writing a script to enumerate URLs; with tags you fire one API call and walk away.

Multi-CDN traffic steering

Large operators (Netflix, Apple, major news sites) run two or more CDNs simultaneously for:

Vendor resilience: one CDN outage does not take the site down.
Regional optimisation: one CDN may have better peering in Asia, another in Latin America.
Commercial leverage: competing CDN contracts reduce per-GB costs.

Steering mechanism: DNS-based steering (NS1 Pulsar, Cedexis Openmix, custom). These aggregate real-user-monitoring (RUM) measurements and update DNS records every few seconds to route to the best-performing CDN per region. DNS TTL: 30 s for fast steering response.

Cost: operational complexity. Purges, headers, and edge-worker code must work identically on every CDN. A purge issued to CDN A does not automatically clear CDN B — each requires its own API call.

CDN operations key numbers

Cache-tag purge propagation time (Cloudflare, Fastly): 1–5 seconds globally
Multi-CDN DNS TTL for steering: 30 s (fast failover)
mTLS edge-to-origin: protects against origin IP exposure: CDN client cert required by origin
WAF OWASP Top 10 block rate (typical production): 0.1–2% of requests (adjust per app)
Healthy cache hit rate (static assets): >90%
Healthy cache hit rate (HTML pages): >70%
Origin shield offload ratio target: >90% of edge misses never reach origin

BGP-level optimisation: Argo and Global Accelerator

Anycast picks the BGP-closest POP, not the latency-closest. On intercontinental paths, BGP “closest” and “lowest latency” diverge significantly.

Cloudflare Argo Smart Routing and AWS Global Accelerator measure actual end-to-end latency from all POPs continuously and route traffic over a private backbone (not the public internet) to the lowest-latency POP. Typical saving: 30–50% reduction in p95 latency on intercontinental paths. Cost: per-GB premium pricing on backbone traversal. Worth it for latency-sensitive APIs; usually overkill for static-asset delivery where BGP is already efficient.

mTLS edge-to-origin. Even with Anycast protecting origin by obscuring its IP, attackers can discover origin IP via DNS history, certificate transparency logs, or misconfigured direct access paths. mTLS (mutual TLS): origin accepts connections only if the client presents the CDN’s certificate. Without the CDN cert, direct-to-origin requests are rejected — origin IP exposure no longer matters.

WAF and bot management at edge

CDNs sit in the request path for all traffic — making them the cheapest layer for attack defence:

WAF (Web Application Firewall): matches request patterns against OWASP Top 10 rule sets (SQL injection, XSS, path traversal, command injection). Block in under 1 ms at edge, no origin involvement.
Bot management: JA3/JA4 TLS fingerprinting (fingerprint the TLS ClientHello), behavioural analysis, IP reputation to distinguish human from automated traffic. Blocks credential stuffing, scraping, and API abuse.
Rate limiting: per-IP, per-token, per-route. Configured at edge; enforced without origin round-trips.
DDoS scrubbing: volumetric attacks (L3/L4) absorbed at edge before reaching origin. Cloudflare’s Anycast network spans 330+ cities, distributing attack traffic across all POPs.

103 Early Hints

RFC 8297 defines the 103 Early Hints informational response, sent before the final 200 OK. The edge can send Link: </style.css>; rel=preload in a 103 response while the origin generates the main HTML. The browser starts fetching critical assets before HTML arrives, saving one RTT from the critical render path. As of 2026: 93% browser support, ~5% real-world adoption. Vercel leads with ~2.8%; Cloudflare and Fastly remain below 1%. Adoption friction: the edge must know which resources to hint per page — not easily automatable without framework support.

Key observability metrics

A CDN incident often starts as a metric drift before it becomes a user complaint:

Metric	Target	Alert threshold
Cache hit rate (static assets)	>90%	<80% triggers investigation
Cache hit rate (HTML pages)	>70%	<60% triggers investigation
Origin shield offload ratio	>90%	<80% — edges may be contacting origin directly
p95 edge response time per region	<50 ms	>100 ms — regional POP issue
p99 edge response time per region	<200 ms	>500 ms — severe regional degradation
Vary-key cardinality per URL	<100	>1000 — check for Vary: User-Agent footgun
WAF block rate	0.1–2%	>5% — possible attack; <0.01% — WAF rules too loose

Export from CDN dashboards to Prometheus/OTel for SLO alerting. CDN-native dashboards (Cloudflare Analytics, Fastly Real-Time) are useful for deep-dive but not for cross-CDN correlation.

HTML carries a deliberately lower SLO than static assets — personalization and Vary fragmentation make it inherently harder to cache, so you hold it to a different bar instead of chasing 90%.

Debug this

curl -I output revealing a CDN misconfiguration

log

$ curl -I https://example.com/article/123
HTTP/2 200
date: Wed, 13 May 2026 14:33:00 GMT
content-type: text/html; charset=utf-8
cache-control: public, max-age=3600
cf-cache-status: MISS
vary: User-Agent, Accept-Encoding, Cookie, Authorization
age: 0
server: cloudflare

Cache hit rate is 5%. What is wrong with the response headers, and how do you fix it?

Trace it

1/4

Origin is down for 8 minutes during a database failover. Users hitting the CDN during the outage. What do users experience with and without stale-if-error configured?

Step 1 of 4

Without stale-if-error: a user requests a product page during the outage. Cache was fresh 20 minutes ago (max-age=300, now expired). What do they see?

Locked

With stale-if-error=86400 (1 day): same scenario. What do users see?

Locked

Origin recovers. How does the CDN know to stop serving stale content and return to serving fresh responses?

Locked

What content types should NOT use stale-if-error?

Design challenge

Design CDN configuration for a news publisher: 50M monthly readers, articles with embedded paywalled content, real-time breaking-news banner, and reader comments.

Article body content (most page bytes) can be stale up to 5 minutes.
Breaking-news banner must update within 30 seconds globally.
Reader comments are user-specific (cannot share across users) but the comment list itself is shared.
Paywall: anonymous users see 3 free articles per month per IP, then a paywall block.

Quiz

Why does multi-CDN traffic steering use DNS rather than HTTP-level redirects?

Cache-tag purge: the CMS sends one tagged call, the CDN fans it out to every POP, and each drops all entries carrying that tag within 1–5 seconds globally — O(tags) calls instead of O(URLs).

Recall before you leave

01
Explain the difference between origin shield and the standard edge cache, and when shield is critical.
02
A deploy pipeline updates origin but doesn't purge the CDN. Users in Europe see stale content 30 minutes after deploy. Users in Asia see fresh content immediately. Why the discrepancy?
03
Your CDN cache hit rate drops from 92% to 45% over two days. List three possible causes in order of likelihood based on common production incidents.

Recap

CDN operations at production scale require four capabilities. (1) Cache-tag purge: assign semantic tags to cached responses, purge by tag on content updates — O(tags) API calls instead of O(URLs). (2) Deploy pipeline integration: every deploy triggers a purge of affected URL patterns or tags immediately after origin updates. (3) Multi-CDN resilience: DNS-based steering (30 s TTL) with RUM data routes users to the best-performing CDN per region; mTLS edge-to-origin prevents bypass after IP exposure. (4) Observability: cache hit rate per URL prefix, origin-shield offload ratio, p95/p99 edge response time per region, Vary-key cardinality. Alert on metric drift — a hit-rate drop precedes user complaints by minutes to hours. WAF and bot management at edge stop attacks before they reach origin. Now when you add CDN in front of a new service, wire up all four from day one: a purge hook in CI, cache-tag headers on every response, per-region dashboards, and mTLS — retrofitting them after an incident is always more expensive than building them in.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

appears again in165

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.