Networking & Protocols
Deployment tradeoffs and CPU cost
Your CDN rolls out HTTP/3. CPU utilization on edge servers spikes 25%, goodput on 1 Gbps paths drops by nearly half, and 4% of clients fall back silently to HTTP/2 with no error visible anywhere. QUIC is genuinely better for latency — and genuinely expensive to operate.
CPU cost: why user-space transport is expensive
QUIC runs in user space — every packet goes through application code, not the kernel’s NIC-to-socket fast path. The cost components are:
- Per-packet AES-GCM encryption — ~15–20 cycles per byte on modern CPUs. Kernel TCP offloads this to the NIC (TLS offload, kTLS). QUIC cannot — the NIC doesn’t speak QUIC yet.
- Variable-length integer decoding — packet numbers, stream IDs, and frame lengths use QUIC’s VarInt encoding. Each decode is a conditional branch in user space.
- Payload framing — STREAM, CRYPTO, and ACK frames are serialized/deserialized per packet in user space. Retransmitted frames must be re-serialized with new packet numbers.
- Syscall overhead — without batching, each UDP
sendmsg()is a syscall. At 1 Gbps with 1500-byte packets that’s ~83 k syscalls/s per core.
Measured impact: On a 1 Gbps LAN link at full rate, QUIC saturates a CPU core before the link fills. Goodput drops up to ~45% vs HTTP/2 on TCP. On slow or lossy paths (typical mobile) the network — not CPU — is the bottleneck, so the overhead is unmeasurable.
Mitigations: UDP GSO, NIC offload, core affinity
UDP Generic Segmentation Offload (UDP GSO): Instead of one sendmsg() per 1500-byte packet, batch up to 64 KB of QUIC payload in a single syscall with a GSO hint. The kernel segments it into individual UDP datagrams before NIC DMA. Result: ~3–4 syscalls per 64 KB instead of 43. Cloudflare reports ~20% CPU gain from GSO alone.
NIC QUIC offload (Intel E810 and newer): Parse QUIC long/short headers in silicon, route packets to the correct QUIC stream without involving user-space demux. Reduces per-packet interrupt overhead. Still experimental as of 2026 but available in cloud-optimized NICs.
Core affinity: Keep the QUIC process on the same physical core. QUIC state (connection table, cc windows, key material) fits in L3 cache. Cross-core migrations flush cache lines, adding ~50 ns per packet.
With GSO + affinity, per-byte CPU cost drops from 30% to 15–20% overhead vs kernel TCP.
Why does QUIC's CPU overhead hurt throughput on a 1 Gbps LAN but not on a 10 Mbps mobile link?
UDP blocking and mandatory HTTP/2 fallback
~3–5% of networks block UDP outright — corporate proxies, certain ISP gateways, some LTE contexts. When QUIC is silently dropped (no ICMP error, just lost packets) the client’s only signal is a timeout — typically 1–3 seconds before fallback to TCP.
Browser racing: Modern browsers start both QUIC (UDP 443) and TCP (443) simultaneously. Whichever handshake completes first wins. This limits the UX penalty from UDP blocking to zero — TCP wins and QUIC gracefully loses the race. The downside: wasted effort on every connection if QUIC is consistently blocked on a path.
Alt-Svc discovery: HTTP/3 is advertised via Alt-Svc: h3=":443"; ma=3600 in the HTTP/2 response. The browser caches this and attempts QUIC on the next connection. First-time connections always fall back to TCP, discover the header, and upgrade on subsequent requests.
RFC 9000 mandate: Implementations MUST support fallback to HTTP/2 over TCP. A deployment that doesn’t is non-compliant and will break in blocked networks.
A user on a corporate network reports that loading your site is 2 seconds slower than usual. QUIC is enabled. What is the most likely cause and how do you verify?
Deployment reality 2026
~21% of web traffic runs HTTP/3. ~35–40% of major sites advertise it via Alt-Svc. Adoption is bimodal:
- Mobile browsers default-enable HTTP/3 — latency and HoL elimination matter on cellular.
- Desktop/LAN still races QUIC vs TCP; TCP often wins because round-trips are short and HoL blocking is rare on fast paths.
Major CDNs (Cloudflare, Google, Akamai, Fastly) enable HTTP/3 by default. Browsers (Chrome, Safari, Firefox) support it. The adoption curve will steepen as UDP blocking becomes rarer, hardware offload reduces CPU cost, and fallback racing becomes universal.
QUIC packet trace — diagnose encryption-level and loss issues
$ quictrace capture.pcapng | head -20
timestamp=0.000 dcid=12345678 type=Initial pkt_num=0 frames=[Crypto[0..120], Padding]
timestamp=0.045 dcid=12345678 type=Initial pkt_num=1 frames=[Crypto[120..240], Padding] # Retransmit (no ACK in time)
timestamp=0.051 scid=87654321 dcid=12345678 type=Initial pkt_num=0 frames=[Crypto[0..200], Ack[0], Padding]
timestamp=0.052 dcid=87654321 type=Handshake pkt_num=0 frames=[Crypto[200..350]]
timestamp=0.100 scid=87654321 dcid=12345678 type=Handshake pkt_num=0 frames=[Crypto[350..400], Finished]
timestamp=0.101 dcid=87654321 type=1RTT pkt_num=0 frames=[Stream(0, fin, 4096 bytes)]
timestamp=0.151 scid=87654321 dcid=12345678 type=1RTT pkt_num=0 frames=[Stream(0, fin, [all bytes 0..4095], Ack[0])] The client sees one Initial retransmit before the server's Initial arrives. The Handshake then flows normally. What does this indicate?
Observability gaps
QUIC’s encryption prevents packet inspection — tcpdump shows only opaque blobs. Traditional network monitoring (per-flow HTTP request counts, slow client detection, misbehavior at flow level) breaks.
Adapting the stack:
- Applications export QUIC traces via JSON (RFC 9312 qlog format) — connection lifecycle, packet numbers, CC events.
- Browsers report
PerformanceResourceTiming.nextHopProtocol = "h3"for HTTP/3 connections. - Cloud providers (AWS, GCP) are adding QUIC-aware flow metrics.
- eBPF probes on userspace QUIC sockets can reconstruct packet timing without decryption.
The trade-off is intrinsic: encryption buys privacy and security at the cost of operational opacity.
A CDN must choose between deploying QUIC for a latency-sensitive API (small request/response, intercontinental) vs. a high-throughput static asset service (1 Gbps, LAN clients).
Design the deployment strategy for rolling out HTTP/3 and QUIC to a global CDN serving both mobile (90% traffic) and desktop (10% traffic), with current HTTP/2 TCP infrastructure.
- Existing HTTP/2 deployment is stable and well-tuned; no breaking changes to TCP paths.
- Mobile clients are diverse (iOS 14+, Android 5+, various browsers); some networks are QUIC-blocked.
- CPU budget for QUIC: no more than 20% overhead vs. current HTTP/2.
- Observability: measure QUIC adoption rate, fallback rate, and latency improvement per client device.
- Mobile benefits from QUIC's latency; desktop from TCP's throughput + familiarity. Different tiers get different protocols.
- Alt-Svc discovery requires a prior HTTP/2 request; race both protocols in parallel for new clients to avoid fallback latency.
- QUIC-blocked networks exist; explicit fallback after a short timeout (2–3s) avoids hanging the user.
- UDP GSO is critical for CPU cost control. Without it, QUIC is too expensive for high-throughput CDNs.
- Measure fallback rate continuously. If > 5%, investigate whether it is genuine UDP blocking or a deployment bug.
- Connection semantics differ: HTTP/2 over TCP has persistent TCP state; HTTP/3 over QUIC moves state to QUIC. Ensure your load balancer and observability stack understand both.
- CPU overhead vs kernel TCP
- 15–30% per byte
- Goodput loss on 1 Gbps fast links
- up to ~45% vs HTTP/2
- UDP GSO CPU gain (Cloudflare)
- ~20% per connection
- Networks blocking UDP
- ~3–5%
- Web traffic running HTTP/3 (2026)
- ~21%
- Major sites advertising HTTP/3
- ~35–40%
- Browser QUIC+TCP racing timeout
- 1–3 s before TCP wins
Why this works
Why not add QUIC to the kernel? Linux has experimental in-kernel QUIC patches, but the community is divided. Kernel TCP benefits from NIC offload (kTLS, GRO, RSS) built over decades. Replicating this for QUIC would take years and couples QUIC’s evolution to the kernel release cycle — the opposite of what RFC 9000 intended. User-space QUIC can ship new CC algorithms weekly; kernel QUIC cannot. The CPU cost is the price of agility.
- 01Why does QUIC's CPU overhead hurt 1 Gbps LAN throughput but not 10 Mbps mobile throughput?
- 02What is UDP GSO and why does Cloudflare report ~20% CPU gain from it?
- 03A CDN sees 4% of QUIC connections fail silently with no error. What is the likely cause and fix?
QUIC’s user-space architecture delivers latency and HoL wins at a real CPU cost: 15–30% more per byte than kernel TCP, rising to ~45% goodput loss on fast 1 Gbps links where CPU — not the network — is the bottleneck. UDP GSO batches syscalls and recovers ~20% CPU; NIC offload and core affinity push it further. About 3–5% of networks block UDP silently, requiring browser-side TCP racing and mandatory HTTP/2 fallback. As of 2026, ~21% of web traffic runs HTTP/3 with bimodal adoption — mobile benefits clearly, desktop races TCP and often loses to it. QUIC encryption breaks traditional packet inspection; qlog (RFC 9312), browser timing APIs, and eBPF probes are the replacement observability stack. The right deployment strategy: QUIC for latency-sensitive WAN and mobile paths, TCP for high-throughput LAN and static assets.
appears again in162
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior