awesome-everything RU
↑ Back to the climb

Networking & Protocols

Defense-in-depth architecture and attack economics

Crux No single defense stops all DDoS vectors — anycast edge absorption, rate limiting, WAF, mTLS, and adaptive load shedding form the layers; attack economics favor the attacker only if you defend alone.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 16 min

You deployed a CDN, rate limiting, and a WAF. Then the attacker switches to cache-miss HTTP floods targeting your most expensive database query, spreading across 10,000 IPs each under your per-IP limit. No single defense catches it. You need to understand how the layers interact and when to escalate to humans.

Defense-in-depth architecture: the full stack. No single defense stops all attacks. The layered approach:

  1. Anycast edge scrubbing — every CDN PoP is an active scrubbing center. Attacks are ingested at the nearest PoP rather than concentrated on the origin. Combined capacity across 330+ Cloudflare PoPs exceeds 37 Tbps. A 10 Gbps attack becomes a rounding error spread across many nodes.
  2. Stateless L3/L4 rate limits — per-ASN, per-prefix rate limits drop obvious amplification sources and SYN flood sources before TCP state is created.
  3. WAF at the edge — detects application-layer patterns (SQLi, XSS, bot fingerprints). Running at PL2 (balanced false-positive/coverage), not PL4 (paranoid) for a public API.
  4. Token bucket per IP + per user — stops obvious botnets and authenticated user abuse.
  5. Adaptive concurrency limiting at origin — when in-flight requests exceed capacity threshold, reject new requests with fast 503. Service remains stable; users get errors instead of timeouts.
  6. Observability and human escalation — when automated defenses fail, on-call engineers add custom rules.
LayerWhat it stopsWhat it misses
Anycast edgeVolumetric floods (Gbps-scale)Intelligent low-rate attacks per IP
Stateless L3/L4 rate limitsAmplification, SYN floodsHTTP-level attacks on valid ports
WAF (PL2)Known attack signatures, bot patternsZero-days, obfuscated payloads, business-logic abuse
Rate limit (per IP/user)Obvious botnets, auth abuseDistributed botnets with residential proxies
Adaptive concurrencyDistributed overload, cache-miss floodsAttacks below the overload threshold
mTLSLateral movement inside the networkExternal-facing attack vectors

mTLS in service meshes with SPIFFE. Istio or Linkerd deploy sidecar proxies on every pod. The control plane (Istiod) runs a SPIFFE-compatible certificate authority. At startup, each sidecar receives a short-lived certificate (1–24 hours). Certificates rotate via SDS (Service Workload API) push — no sidecar restart needed. Every service-to-service call: (1) mTLS handshake (20–50 ms overhead per new connection on older hardware), (2) encrypted payload, (3) both sides verify certificates. Prevents lateral movement if the pod network is compromised. Cost: cert rotation adds operational complexity and monitoring burden (expired cert = infrastructure incident, not a bug).

Protocol/state-exhaustion in depth. SYN floods: each SYN allocates a half-open connection slot in the server’s backlog. When backlog overflows, the server drops new SYN packets from legitimate clients. SYN cookies encode connection state as a cryptographic cookie in the ISN — no memory allocated; legitimate clients reply with a valid ACK that decodes the cookie. ACK floods: RST rate limiting (limit RSTs per second) and firewall suppression of unmatched ACKs. TCP RST injection (on-path MITM): RFC 5961 challenge-ACK forces the attacker to know the exact sequence number rather than just be in-window.

Rate limiting internals: distributed systems complexity. Token bucket with Redis backing: T = min(C, T + R * delta_time). Distributed: each request atomically INCR key; EXPIRE key window. At 100k req/sec, Redis adds 0.5–1 ms per request = 50–100 ms total added latency. Mitigation: local per-server counter + periodic Redis sync (accepts slight inaccuracy, cuts to microsecond decisions). HyperLogLog for approximate rate limiting: ~1.6 kB per sketch, ~2% error, suitable for ASN-level or IP-range limits.

Adaptive concurrency: load shedding formula. Track in-flight requests Q. Set max_queue threshold. For new requests, survival probability = max(0, 1 - Q / max_queue). Accept the request with that probability. This creates graceful degradation: at 50% overload, 50% of new requests are rejected; at 100% overload, all new requests are rejected. Users see fast 503s instead of queue timeouts (which can be 30+ seconds). Circuit breakers reject requests to backends with recent failure rates above threshold — preventing cascading failure across the entire service graph.

Trace it
1/5

Trace a sophisticated attack using amplification + application-layer tactics.

1
Step 1 of 5
Step 1: attacker starts a 10 Gbps memcached amplification attack. What happens without a CDN?
2
Locked
Step 2: you migrate to a CDN (Cloudflare). The 10 Gbps is absorbed at the edge. What is the attack cost to the attacker?
3
Locked
Step 3: attacker switches to HTTP floods — legitimate-looking GET requests, 100,000 req/sec from a botnet. What does rate limiting do?
4
Locked
Step 4: attacker adds ?x=random to every request (bypassing cache), targeting expensive database endpoints. Cache-hit rate drops from 95% to 5%. What is the new problem?
5
Locked
Step 5: attacker adapts the random param with realistic User-Agents and spreads further. Service is still slow. What is the operational response?
Debug this

WAF anomaly scoring during an attack

log
2026-05-15 14:23:00 | requests=45000/sec | score-p50=0.5 | score-p95=3.2 | score-p99=8.1
2026-05-15 14:24:00 | requests=120000/sec | score-p50=6.1 | score-p95=14.2 | score-p99=28.5
2026-05-15 14:25:00 | requests=450000/sec | score-p50=12.3 | score-p95=19.8 | score-p99=32.1
2026-05-15 14:26:00 | blocked=380000 | allowed=70000 | score-threshold=5

The WAF is scoring traffic and blocking at threshold=5. What is happening in this timeline, and what should the operator do?

Attack economics. Attacker cost: ~$50–500/month for a botnet service capable of 10 Gbps sustained. Origin defense cost: ~$500/month CDN bill for 10 Gbps sustained traffic. If the attacker can generate 100 Gbps, the origin cannot match. But a CDN with global PoPs ingests 100s of Tbps and spreads the cost across millions of customers — per-customer cost is tiny. The economics favor the attacker only if you defend alone. Sharing infrastructure (CDN) is the answer: you cannot out-scale a botnet; you can make the attack economically unattractive by making it fail.

Observability at attack time. Key signals: (1) request rate per second — 10x normal is suspicious; (2) geographic distribution of sources — all from one ASN is suspicious; (3) anomaly score p99 — normal users score <1, attack traffic scores >10; (4) cache-hit rate — attack traffic targeting unique parameters shows sudden drop. Alert thresholds: request rate 10x baseline, anomaly score p99 spike, source-IP entropy drop (100 IPs instead of 100,000), or cache-hit rate drop below 80% during a traffic spike. Human escalation if attack persists >5 minutes or exceeds 50 Gbps.

Pick the best fit

An e-commerce service is under sustained application-layer attack. You must pick a defense strategy.

Design challenge

Design the DDoS defense architecture for a 100 Gbps-capable video CDN serving global users. The CDN operates 50 PoPs in 30 countries.

  • Absorb 100+ Gbps attacks at the edge without exceeding 10% of any PoP's capacity.
  • Defend against volumetric (L3/L4), protocol (SYN floods), and application-layer (HTTP floods) attacks.
  • Maintain p50 latency &lt; 50 ms and p99 &lt; 200 ms for legitimate users during attack.
  • Detect and block new attack patterns within 60 seconds.
Why this works

Why is adaptive concurrency limiting the preferred answer for a Black Friday e-commerce attack, not WAF PL4? WAF PL4 makes preemptive decisions at the request level based on content patterns. If those patterns are wrong (5% false positive), you block paying customers. Adaptive concurrency limiting makes reactive decisions at the system level based on actual load. It never blocks a request preemptively — it only rejects when the system is already overloaded (which is bad regardless of attack). The tradeoff: adaptive limiting accepts that some attack requests go through until the system hits capacity, then rejects everything equally. That is acceptable when the alternative is blocking 5% of legitimate Black Friday customers.

Quiz

Your WAF is at Paranoia Level 2 and attacks get through (only 70% blocked). You raise it to PL4. Legitimate customers now complain (5% false positives). What is a better approach?

Recall before you leave
  1. 01
    Why is adaptive concurrency limiting preferable to WAF PL4 for a high-traffic production service under attack?
  2. 02
    During a Rapid Reset attack (CVE-2023-44487), why does the attack bypass HTTP/1.1-only rate limiting?
  3. 03
    What metrics should an on-call engineer monitor during a DDoS attack and what thresholds signal escalation?
Recap

Defense-in-depth against DDoS requires stacking multiple layers because no single defense stops all vectors. Anycast edge absorption distributes volumetric attacks across 330+ global PoPs; stateless L3/L4 filters drop amplification and SYN floods before they consume connection state; WAF at PL2 detects known application-layer patterns with tolerable false positives; adaptive concurrency limiting at the origin rejects on overload rather than preemptive IP blocking. Attack economics favor defenders only when using shared CDN infrastructure — a botnet generating 100 Gbps for $500/month is defeated by a CDN that amortizes defense capacity across millions of customers. When automated defenses fail, observability (request rate, anomaly scores, cache-hit rate, source-IP entropy) gives on-call engineers the signal they need to add custom rules within the 60-second escalation window.

Connected lessons
appears again in258
Continue the climb ↑Network security: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.