Networking & Protocols
Bufferbloat and congestion
Someone on the team is on a video call and their voice keeps freezing — but the speed test says 300 Mbps, green across the board. Meanwhile a cloud backup is uploading in the background. The throughput is fine. The latency is not. That gap has a name.
The buffer that ate your latency
A network buffer exists to smooth bursts: when packets arrive faster than the link drains them, they wait in a queue instead of being dropped. That is healthy in moderation. Bufferbloat is the failure that comes from too much of it.
The edge of the Internet — DOCSIS cable modems, LTE basebands, home routers — is where the fast side (your gigabit LAN) meets the slow side (the metered uplink). That mismatch is the bottleneck, and vendors over-provisioned the buffer at exactly that point. A modem can hold 100–500 ms of buffered data. Under saturation, every packet behind a full backup queue waits the full depth of that buffer.
The symptom is unmistakable once you know it: idle ping is 20 ms, then someone starts an upload and ping climbs to 300–500 ms. Throughput looks perfect — the link is fully used — but every interactive packet (a DNS lookup, a SYN, a video frame) sits behind the bulk transfer. Calls stutter, games lag, web pages stall, all while the speed test reads green.
Why TCP needs the drop
The deeper cause is a mismatch between buffers and congestion control. Loss-based TCP — CUBIC, the Linux default for years — has no other way to learn the link is full. It keeps increasing its sending window until a packet is dropped, reads that drop as “the pipe is full,” and backs off.
A correctly sized buffer drops a packet early, when the queue is shallow, so TCP gets the congestion signal while latency is still low. An oversized buffer absorbs the packet instead. TCP sees no loss, assumes there is room, and pushes more data — which the giant buffer also absorbs. The window inflates until the buffer is finally, completely full, hundreds of milliseconds deep. Vendors thought “never drop a packet” was the safe choice. It is the exact opposite: by hiding the congestion signal, over-buffering guarantees the queue runs deep.
- Edge buffer depth (DOCSIS / LTE)
- 100–500 ms of queued data
- Idle vs saturated ping (no AQM)
- 20 ms → 300–500 ms
- Saturated ping with fq_codel / CAKE
- stays under ~30 ms
- CUBIC over GEO (~600 ms RTT)
- starves — windows sent before loss returns
- BBR throughput recovery on GEO
- most of link capacity, loss-independent
- LEO RTT (Starlink, ~550 km orbit)
- ~50–60 ms — CUBIC behaves terrestrial
Why it persists
Bufferbloat has been understood since the 1980s, yet it is still everywhere. Three reasons:
- Wrong incentive. Vendors treated packet drops as a defect and over-provisioned buffers to “avoid drops” — the wrong fix, because the drop is the signal.
- Invisible metric. Residential users measure throughput, not latency under load. A speed test runs on an otherwise-idle link and never sees the bloat. The problem only surfaces when an interactive app fights a bulk transfer.
- Deployment lag. The cure ships in firmware. It needs the router to run Smart Queue Management (SQM), and a lot of carrier-issued kit never gets that update.
The fix itself is not exotic.
Active Queue Management
Active Queue Management (AQM) — marketed as Smart Queue Management on home routers — drops or ECN-marks packets early and fairly, before the queue grows deep, restoring the congestion signal TCP needs.
- fq_codel (RFC 8290) — flow-queued CoDel. CoDel (“Controlled Delay”) watches how long packets dwell in the queue, not how many there are; once dwell time exceeds a target (~5 ms) for too long, it starts dropping. The flow-queue half hashes flows into separate sub-queues so one bulk upload cannot starve a latency-sensitive flow.
- PIE (RFC 8033) — Proportional Integral controller Enhanced. Estimates queue delay and drops with a probability tuned to hold delay near a target. PIE is the AQM mandated by DOCSIS 3.1, so it ships inside modern cable modems.
- CAKE — Common Applications Kept Enhanced. fq_codel plus built-in bandwidth shaping, per-host fairness, and DOCSIS/ATM framing compensation. It is the SQM most home-router projects (OpenWrt) reach for.
The shared idea: cap queueing delay, not queue length. Under full saturation, a link with CAKE or fq_codel keeps ping under ~30 ms instead of letting it balloon to 500 ms.
Why this works
Why SQM has to shape below line rate. AQM can only manage a queue it actually owns. If the bottleneck buffer lives inside the ISP’s modem, your router’s queue is never the one that fills — packets sail through your router and pile up downstream where you have no control. So SQM deliberately shapes egress to ~90–95% of the real uplink rate. That moves the bottleneck back into your router, where fq_codel or CAKE governs it. You trade a sliver of peak throughput for a queue you can actually discipline — almost always the right trade for an interactive household.
When the buffer is not the problem: BBR vs CUBIC
Congestion control also breaks on long-RTT paths, and there the answer is a different algorithm rather than a different queue.
LEO satellite (Starlink, ~550 km orbit, ~50–60 ms total RTT) behaves like a terrestrial link — loss-based CUBIC works fine. GEO satellite (~36,000 km orbit, ~600 ms RTT) does not. With a 600 ms feedback loop, by the time a loss signal travels back to the sender, multiple full windows have already been transmitted. Loss-based control reacts far too late; CUBIC ramps slowly and never fills the pipe — it starves.
BBR (Bottleneck Bandwidth and Round-trip propagation time) sidesteps loss entirely. It actively probes the path’s bandwidth and minimum RTT, builds a model of the bottleneck, and paces sends to that model. Because it does not wait for a drop, the 600 ms feedback delay no longer cripples it, and BBR recovers most of the GEO link’s capacity. The same loss-independence makes BBR strong on any lossy path — cellular, congested Wi-Fi.
These two threads meet at the cell tower: mobile operators deploy CAKE-style AQM at the radio cell so the buffer for users sharing one cell stays disciplined. The pattern to remember: BBR + AQM + small buffers wins on high-latency or lossy paths; CUBIC + a sanely sized buffer is fine on terrestrial wired links.
A modem buffers 300 ms of data and 'never drops a packet.' Why does this make latency worse, not better?
A household runs video calls that stutter whenever a cloud backup uploads. Pick the fix.
A remote worker reports video calls freeze every afternoon. Speed tests look perfect. Trace the diagnosis.
BBR throughput on GEO satellite
1/3- 01Explain bufferbloat and why it persists despite being a known problem since the 1980s.
- 02What does Active Queue Management do differently from a plain FIFO buffer, and name three AQM algorithms.
- 03Why does loss-based CUBIC starve over a GEO satellite link while BBR recovers most of the throughput?
Bufferbloat is oversized buffering at the network edge — DOCSIS modems, LTE basebands, home routers hold 100–500 ms of queued data. Loss-based TCP like CUBIC needs an early packet drop to sense congestion; a giant buffer absorbs the drop instead, so TCP keeps inflating its window until ping balloons from 20 ms to 300–500 ms while throughput still looks green. It persists because vendors chased “no drops,” users measure throughput not latency, and the fix ships in firmware. Active Queue Management — fq_codel (RFC 8290), PIE (RFC 8033), CAKE — drops or marks early and fairly, capping queueing delay under ~30 ms even at saturation; SQM shapes below line rate to own the bottleneck queue. On long-RTT paths the answer is a different algorithm: BBR probes bandwidth and RTT instead of waiting for loss, recovering throughput on GEO satellite where CUBIC starves.
appears again in162
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior