Networking & Protocols
CDN operations and observability
You shipped a hotfix at 3 AM. Origin is updated. But 20 minutes later, 40% of users are still on the old version — different CDN regions have different cache states, a stale edge in Asia-Pacific is serving yesterday’s data, and your monitoring only shows aggregate error rates that look fine. CDN incidents are invisible until they are not.
Cache-tag purge: surgical invalidation at scale
For a site with thousands of URLs (news publisher, e-commerce catalogue), URL-based purge is operationally unmanageable. Cache tags (Cloudflare Enterprise, Fastly) solve this:
- Origin sets
Cache-Tag: article-1001, category-techon article responses. - On article edit, the CMS calls
POST /cdn/purge {"tag": "article-1001"}. - CDN invalidates all cached responses carrying that tag — across all POPs, within seconds.
- The category tag allows purging all articles in a category (
tag: category-tech) in one API call.
Without cache tags: every edit triggers O(URLs) purge calls. With cache tags: every edit triggers O(tags) calls, typically 1–5.
Multi-CDN traffic steering
Large operators (Netflix, Apple, major news sites) run two or more CDNs simultaneously for:
- Vendor resilience: one CDN outage does not take the site down.
- Regional optimisation: one CDN may have better peering in Asia, another in Latin America.
- Commercial leverage: competing CDN contracts reduce per-GB costs.
Steering mechanism: DNS-based steering (NS1 Pulsar, Cedexis Openmix, custom). These aggregate real-user-monitoring (RUM) measurements and update DNS records every few seconds to route to the best-performing CDN per region. DNS TTL: 30 s for fast steering response.
Cost: operational complexity. Purges, headers, and edge-worker code must work identically on every CDN. A purge issued to CDN A does not automatically clear CDN B — each requires its own API call.
- Cache-tag purge propagation time (Cloudflare, Fastly)
- 1–5 seconds globally
- Multi-CDN DNS TTL for steering
- 30 s (fast failover)
- mTLS edge-to-origin: protects against origin IP exposure
- CDN client cert required by origin
- WAF OWASP Top 10 block rate (typical production)
- 0.1–2% of requests (adjust per app)
- Healthy cache hit rate (static assets)
- >90%
- Healthy cache hit rate (HTML pages)
- >70%
- Origin shield offload ratio target
- >90% of edge misses never reach origin
BGP-level optimisation: Argo and Global Accelerator
Anycast picks the BGP-closest POP, not the latency-closest. On intercontinental paths, BGP “closest” and “lowest latency” diverge significantly.
Cloudflare Argo Smart Routing and AWS Global Accelerator measure actual end-to-end latency from all POPs continuously and route traffic over a private backbone (not the public internet) to the lowest-latency POP. Typical saving: 30–50% reduction in p95 latency on intercontinental paths. Cost: per-GB premium pricing on backbone traversal. Worth it for latency-sensitive APIs; usually overkill for static-asset delivery where BGP is already efficient.
mTLS edge-to-origin. Even with Anycast protecting origin by obscuring its IP, attackers can discover origin IP via DNS history, certificate transparency logs, or misconfigured direct access paths. mTLS (mutual TLS): origin accepts connections only if the client presents the CDN’s certificate. Without the CDN cert, direct-to-origin requests are rejected — origin IP exposure no longer matters.
WAF and bot management at edge
CDNs sit in the request path for all traffic — making them the cheapest layer for attack defence:
- WAF (Web Application Firewall): matches request patterns against OWASP Top 10 rule sets (SQL injection, XSS, path traversal, command injection). Block in under 1 ms at edge, no origin involvement.
- Bot management: JA3/JA4 TLS fingerprinting (fingerprint the TLS ClientHello), behavioural analysis, IP reputation to distinguish human from automated traffic. Blocks credential stuffing, scraping, and API abuse.
- Rate limiting: per-IP, per-token, per-route. Configured at edge; enforced without origin round-trips.
- DDoS scrubbing: volumetric attacks (L3/L4) absorbed at edge before reaching origin. Cloudflare’s Anycast network spans 330+ cities, distributing attack traffic across all POPs.
103 Early Hints
RFC 8297 defines the 103 Early Hints informational response, sent before the final 200 OK. The edge can send Link: </style.css>; rel=preload in a 103 response while the origin generates the main HTML. The browser starts fetching critical assets before HTML arrives, saving one RTT from the critical render path. As of 2026: 93% browser support, ~5% real-world adoption. Vercel leads with ~2.8%; Cloudflare and Fastly remain below 1%. Adoption friction: the edge must know which resources to hint per page — not easily automatable without framework support.
Key observability metrics
A CDN incident often starts as a metric drift before it becomes a user complaint:
| Metric | Target | Alert threshold |
|---|---|---|
| Cache hit rate (static assets) | >90% | <80% triggers investigation |
| Cache hit rate (HTML pages) | >70% | <60% triggers investigation |
| Origin shield offload ratio | >90% | <80% — edges may be contacting origin directly |
| p95 edge response time per region | <50 ms | >100 ms — regional POP issue |
| p99 edge response time per region | <200 ms | >500 ms — severe regional degradation |
| Vary-key cardinality per URL | <100 | >1000 — check for Vary: User-Agent footgun |
| WAF block rate | 0.1–2% | >5% — possible attack; <0.01% — WAF rules too loose |
Export from CDN dashboards to Prometheus/OTel for SLO alerting. CDN-native dashboards (Cloudflare Analytics, Fastly Real-Time) are useful for deep-dive but not for cross-CDN correlation.
curl -I output revealing a CDN misconfiguration
$ curl -I https://example.com/article/123
HTTP/2 200
date: Wed, 13 May 2026 14:33:00 GMT
content-type: text/html; charset=utf-8
cache-control: public, max-age=3600
cf-cache-status: MISS
vary: User-Agent, Accept-Encoding, Cookie, Authorization
age: 0
server: cloudflare Cache hit rate is 5%. What is wrong with the response headers, and how do you fix it?
Origin is down for 8 minutes during a database failover. Users hitting the CDN during the outage. What do users experience with and without stale-if-error configured?
Design CDN configuration for a news publisher: 50M monthly readers, articles with embedded paywalled content, real-time breaking-news banner, and reader comments.
- Article body content (most page bytes) can be stale up to 5 minutes.
- Breaking-news banner must update within 30 seconds globally.
- Reader comments are user-specific (cannot share across users) but the comment list itself is shared.
- Paywall: anonymous users see 3 free articles per month per IP, then a paywall block.
- Three-layer cacheability: article body (5 min TTL + SWR), breaking news (30 s KV, not per-user fetch), per-user state (edge worker from session KV).
- stale-while-revalidate prevents cache stampede on popular articles at expiry.
- Cache-tag article-id enables surgical purge on edits without clearing other articles.
- Edge worker enforces paywall counting without origin round-trips — not bypassable by clearing cookies.
- Per-region observability catches regional cache issues before users notice.
Why does multi-CDN traffic steering use DNS rather than HTTP-level redirects?
- 01Explain the difference between origin shield and the standard edge cache, and when shield is critical.
- 02A deploy pipeline updates origin but doesn't purge the CDN. Users in Europe see stale content 30 minutes after deploy. Users in Asia see fresh content immediately. Why the discrepancy?
- 03Your CDN cache hit rate drops from 92% to 45% over two days. List three possible causes in order of likelihood based on common production incidents.
CDN operations at production scale require four capabilities. (1) Cache-tag purge: assign semantic tags to cached responses, purge by tag on content updates — O(tags) API calls instead of O(URLs). (2) Deploy pipeline integration: every deploy triggers a purge of affected URL patterns or tags immediately after origin updates. (3) Multi-CDN resilience: DNS-based steering (30 s TTL) with RUM data routes users to the best-performing CDN per region; mTLS edge-to-origin prevents bypass after IP exposure. (4) Observability: cache hit rate per URL prefix, origin-shield offload ratio, p95/p99 edge response time per region, Vary-key cardinality. Alert on metric drift — a hit-rate drop precedes user complaints by minutes to hours. WAF and bot management at edge stop attacks before they reach origin.
appears again in162
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior