awesome-everything RU
↑ Back to the climb

Networking & Protocols

CDN operations and observability

Crux Cache-tag purge, multi-CDN steering, WAF at edge, the key metrics that catch CDN incidents before users notice, and a news-publisher design case study.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

You shipped a hotfix at 3 AM. Origin is updated. But 20 minutes later, 40% of users are still on the old version — different CDN regions have different cache states, a stale edge in Asia-Pacific is serving yesterday’s data, and your monitoring only shows aggregate error rates that look fine. CDN incidents are invisible until they are not.

Cache-tag purge: surgical invalidation at scale

For a site with thousands of URLs (news publisher, e-commerce catalogue), URL-based purge is operationally unmanageable. Cache tags (Cloudflare Enterprise, Fastly) solve this:

  1. Origin sets Cache-Tag: article-1001, category-tech on article responses.
  2. On article edit, the CMS calls POST /cdn/purge {"tag": "article-1001"}.
  3. CDN invalidates all cached responses carrying that tag — across all POPs, within seconds.
  4. The category tag allows purging all articles in a category (tag: category-tech) in one API call.

Without cache tags: every edit triggers O(URLs) purge calls. With cache tags: every edit triggers O(tags) calls, typically 1–5.

Multi-CDN traffic steering

Large operators (Netflix, Apple, major news sites) run two or more CDNs simultaneously for:

  • Vendor resilience: one CDN outage does not take the site down.
  • Regional optimisation: one CDN may have better peering in Asia, another in Latin America.
  • Commercial leverage: competing CDN contracts reduce per-GB costs.

Steering mechanism: DNS-based steering (NS1 Pulsar, Cedexis Openmix, custom). These aggregate real-user-monitoring (RUM) measurements and update DNS records every few seconds to route to the best-performing CDN per region. DNS TTL: 30 s for fast steering response.

Cost: operational complexity. Purges, headers, and edge-worker code must work identically on every CDN. A purge issued to CDN A does not automatically clear CDN B — each requires its own API call.

CDN operations key numbers
Cache-tag purge propagation time (Cloudflare, Fastly)
1–5 seconds globally
Multi-CDN DNS TTL for steering
30 s (fast failover)
mTLS edge-to-origin: protects against origin IP exposure
CDN client cert required by origin
WAF OWASP Top 10 block rate (typical production)
0.1–2% of requests (adjust per app)
Healthy cache hit rate (static assets)
>90%
Healthy cache hit rate (HTML pages)
>70%
Origin shield offload ratio target
>90% of edge misses never reach origin

BGP-level optimisation: Argo and Global Accelerator

Anycast picks the BGP-closest POP, not the latency-closest. On intercontinental paths, BGP “closest” and “lowest latency” diverge significantly.

Cloudflare Argo Smart Routing and AWS Global Accelerator measure actual end-to-end latency from all POPs continuously and route traffic over a private backbone (not the public internet) to the lowest-latency POP. Typical saving: 30–50% reduction in p95 latency on intercontinental paths. Cost: per-GB premium pricing on backbone traversal. Worth it for latency-sensitive APIs; usually overkill for static-asset delivery where BGP is already efficient.

mTLS edge-to-origin. Even with Anycast protecting origin by obscuring its IP, attackers can discover origin IP via DNS history, certificate transparency logs, or misconfigured direct access paths. mTLS (mutual TLS): origin accepts connections only if the client presents the CDN’s certificate. Without the CDN cert, direct-to-origin requests are rejected — origin IP exposure no longer matters.

WAF and bot management at edge

CDNs sit in the request path for all traffic — making them the cheapest layer for attack defence:

  • WAF (Web Application Firewall): matches request patterns against OWASP Top 10 rule sets (SQL injection, XSS, path traversal, command injection). Block in under 1 ms at edge, no origin involvement.
  • Bot management: JA3/JA4 TLS fingerprinting (fingerprint the TLS ClientHello), behavioural analysis, IP reputation to distinguish human from automated traffic. Blocks credential stuffing, scraping, and API abuse.
  • Rate limiting: per-IP, per-token, per-route. Configured at edge; enforced without origin round-trips.
  • DDoS scrubbing: volumetric attacks (L3/L4) absorbed at edge before reaching origin. Cloudflare’s Anycast network spans 330+ cities, distributing attack traffic across all POPs.

103 Early Hints

RFC 8297 defines the 103 Early Hints informational response, sent before the final 200 OK. The edge can send Link: </style.css>; rel=preload in a 103 response while the origin generates the main HTML. The browser starts fetching critical assets before HTML arrives, saving one RTT from the critical render path. As of 2026: 93% browser support, ~5% real-world adoption. Vercel leads with ~2.8%; Cloudflare and Fastly remain below 1%. Adoption friction: the edge must know which resources to hint per page — not easily automatable without framework support.

Key observability metrics

A CDN incident often starts as a metric drift before it becomes a user complaint:

MetricTargetAlert threshold
Cache hit rate (static assets)>90%<80% triggers investigation
Cache hit rate (HTML pages)>70%<60% triggers investigation
Origin shield offload ratio>90%<80% — edges may be contacting origin directly
p95 edge response time per region<50 ms>100 ms — regional POP issue
p99 edge response time per region<200 ms>500 ms — severe regional degradation
Vary-key cardinality per URL<100>1000 — check for Vary: User-Agent footgun
WAF block rate0.1–2%>5% — possible attack; <0.01% — WAF rules too loose

Export from CDN dashboards to Prometheus/OTel for SLO alerting. CDN-native dashboards (Cloudflare Analytics, Fastly Real-Time) are useful for deep-dive but not for cross-CDN correlation.

Debug this

curl -I output revealing a CDN misconfiguration

log
$ curl -I https://example.com/article/123
HTTP/2 200
date: Wed, 13 May 2026 14:33:00 GMT
content-type: text/html; charset=utf-8
cache-control: public, max-age=3600
cf-cache-status: MISS
vary: User-Agent, Accept-Encoding, Cookie, Authorization
age: 0
server: cloudflare

Cache hit rate is 5%. What is wrong with the response headers, and how do you fix it?

Trace it
1/4

Origin is down for 8 minutes during a database failover. Users hitting the CDN during the outage. What do users experience with and without stale-if-error configured?

1
Step 1 of 4
Without stale-if-error: a user requests a product page during the outage. Cache was fresh 20 minutes ago (max-age=300, now expired). What do they see?
2
Locked
With stale-if-error=86400 (1 day): same scenario. What do users see?
3
Locked
Origin recovers. How does the CDN know to stop serving stale content and return to serving fresh responses?
4
Locked
What content types should NOT use stale-if-error?
Design challenge

Design CDN configuration for a news publisher: 50M monthly readers, articles with embedded paywalled content, real-time breaking-news banner, and reader comments.

  • Article body content (most page bytes) can be stale up to 5 minutes.
  • Breaking-news banner must update within 30 seconds globally.
  • Reader comments are user-specific (cannot share across users) but the comment list itself is shared.
  • Paywall: anonymous users see 3 free articles per month per IP, then a paywall block.
Quiz

Why does multi-CDN traffic steering use DNS rather than HTTP-level redirects?

Recall before you leave
  1. 01
    Explain the difference between origin shield and the standard edge cache, and when shield is critical.
  2. 02
    A deploy pipeline updates origin but doesn't purge the CDN. Users in Europe see stale content 30 minutes after deploy. Users in Asia see fresh content immediately. Why the discrepancy?
  3. 03
    Your CDN cache hit rate drops from 92% to 45% over two days. List three possible causes in order of likelihood based on common production incidents.
Recap

CDN operations at production scale require four capabilities. (1) Cache-tag purge: assign semantic tags to cached responses, purge by tag on content updates — O(tags) API calls instead of O(URLs). (2) Deploy pipeline integration: every deploy triggers a purge of affected URL patterns or tags immediately after origin updates. (3) Multi-CDN resilience: DNS-based steering (30 s TTL) with RUM data routes users to the best-performing CDN per region; mTLS edge-to-origin prevents bypass after IP exposure. (4) Observability: cache hit rate per URL prefix, origin-shield offload ratio, p95/p99 edge response time per region, Vary-key cardinality. Alert on metric drift — a hit-rate drop precedes user complaints by minutes to hours. WAF and bot management at edge stop attacks before they reach origin.

Connected lessons
appears again in162
Continue the climb ↑CDN and edge: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.