Networking & Protocols
BBR, production observability, and beyond TCP
A video-streaming service ships 4K segments over intercontinental cellular paths (150 ms RTT, 1% random loss). CUBIC throughput collapses to a fraction of the link’s capacity. Switching to BBR sustains throughput near line rate on the same path. The difference is not bandwidth — it is a different answer to the question “what does a dropped packet mean?”
BBR vs CUBIC vs Reno
Reno (classic): halve cwnd on loss, additive increase per RTT. Simple, widely implemented.
CUBIC (Linux default since 2.6.19): a cubic-curve growth function — concave probe below the previous max, convex probe above it. Reduces ~0.7× on loss. Recovers throughput on high-BDP paths faster than Reno.
BBR (Bottleneck Bandwidth and RTT): abandons loss-based signalling entirely. It estimates the path’s bottleneck bandwidth (via delivered-byte rate) and minimum RTT (via packet timestamps) directly, then paces sends to match. Loss is treated as ambiguous noise — it could be congestion or random drop. On 1% random loss, BBR sustains near-line-rate throughput; CUBIC settles into a small fraction.
BBRv3 (Google’s 2023 release): fixes BBRv2’s premature-probe and convergence bugs. Deployed across google.com, YouTube, Cloudflare, and Netflix edges. Not merged into mainline Linux as of early 2025; requires a custom kernel or third-party backport. Mainline Linux 6.x ships BBR 1.x.
Practical guidance: stick with CUBIC for general-purpose servers on standard kernels, standard datacenter and regional ISP paths where loss is rare and loss genuinely means congestion. Switch to BBR for cross-continental WAN, cellular, or satellite paths where 0.5–2% random loss is normal and CUBIC’s cuts collapse throughput.
Set per-socket: setsockopt(SOL_TCP, TCP_CONGESTION, "bbr"). System-wide: net.ipv4.tcp_congestion_control=bbr.
Trace a BBR congestion episode on a lossy cellular path and explain why BBR sustains throughput while CUBIC collapses.
A video-streaming service ships 4K segments over long-RTT cellular networks (RTT often greater than 150 ms, sporadic loss 0.5–2%). Pick the TCP congestion control algorithm + tuning combination.
Production observability
ss -tin dumps live cwnd, RTT, RTT variance, retransmits, and backoff state per connection — no kernel overhead. Run it on any production host.
$ ss -tin state established | grep -A1 "dport 443"
# Output includes: cwnd:10 ssthresh:2147483647 rtt:42.8/8.5 acked:142 retrans:0/0Key fields:
cwnd: current congestion window in MSSrtt/rttvar: smoothed RTT and variance in millisecondsretrans:sent/outstanding: total retransmissionsssthresh: slow-start threshold (2147483647 = infinity = in slow start)
ss -s summarises socket counts by state — watch CLOSE-WAIT spikes.
nstat exposes counters from /proc/net/netstat: RetransSegs, TCPSlowStartRetrans, TCPDSACKRecv. For long-term monitoring, tcp_diag feeds Prometheus exporters (node_exporter exposes most metrics); the SLO-relevant metrics are retransmission rate, p95/p99 RTT, and the CLOSE-WAIT:ESTABLISHED ratio.
ss output during an outage — diagnose the issue
$ ss -tan state established | wc -l
12384
$ ss -tan state close-wait | wc -l
9821
$ ss -tan state time-wait | wc -l
1247
$ ss -s
Total: 12500
TCP: 23552 (estab 12384, closed 8920, orphaned 2, timewait 1247)
$ ps -p 1234 -o pid,stat,rss,vsz,cmd
PID STAT RSS VSZ CMD
1234 Ssl 8392000 12000000 /usr/bin/app-server 12k ESTABLISHED + 9.8k CLOSE-WAIT sockets and RSS is climbing. What is the bug and the fix?
RST semantics
A TCP RST is an abrupt connection close — no FIN exchange, no TIME-WAIT, the receiver drops connection state immediately. It occurs when:
- A packet arrives for a port no one is listening on.
- The application calls
close()on a socket with unread data andSO_LINGERwith lingertime=0. - The peer sends garbage that violates the state machine.
- A stateful firewall decides the connection is idle.
RST attacks: an attacker who can guess sequence numbers within the receive window can forge an RST and tear down an established connection. RFC 5961 tightens the acceptable RST window. Long-lived idle connections (BGP sessions, SSH) are most vulnerable.
MPTCP (RFC 8684)
Multipath TCP carries one logical connection across multiple paths (Wi-Fi + cellular, multi-NIC server). The MPTCP handshake adds an MP_CAPABLE option in SYN/SYN-ACK/ACK; if both ends support it the first sub-flow is established, and additional sub-flows can be opened on different interfaces via MP_JOIN. iOS uses MPTCP since iOS 7 for Siri. Linux 5.6+ ships RFC 8684. Limited adoption outside Apple because middleboxes that do not understand the option fall back to plain TCP.
kTLS + TCP
kTLS (Linux 4.13+ TX, 4.17+ RX, NIC offload in 6.0+) moves symmetric TLS record encryption into the kernel via setsockopt(SOL_TLS, ...). After the user-space TLS handshake completes, the kernel takes over record encryption; combined with sendfile(), files move from page-cache to NIC without entering user space. Netflix reports 8–29% CPU savings on static asset delivery. kTLS does not change TCP behaviour — congestion control, retransmits, window management all remain standard.
TCP’s relationship to QUIC
TCP is one layer in the stack; TLS sits directly on top; HTTP/1.1 and HTTP/2 ride on TLS. HTTP/3 is the exception — it runs on QUIC, which uses UDP and reinvents reliability and congestion control in user space. The reason: evolving TCP in kernel space proved too slow. The lessons of TCP — sequence numbers, ACKs, congestion control, slow start, fast retransmit — all reappear in QUIC, just at a different layer. Understanding TCP makes QUIC mechanistically transparent; the inverse is not true.
Which RFC specifies RACK-TLP, the modern loss detection algorithm used by Linux to avoid waiting for the RTO timer?
Design the kernel-tunable set for a high-traffic API gateway terminating 200k HTTPS connections/second. Outbound traffic to ~50 backend pools, mostly short HTTP/1.1 requests with keep-alive.
- No external dependencies beyond Linux >= 6.0 sysctl.
- Resist SYN floods targeting the public-facing listener.
- Avoid TIME-WAIT exhaustion on outbound traffic to the backend pools.
- Keep latency p99 under 50 ms under steady-state load.
- tcp_syncookies + raised backlog defends against SYN flood without bricking legitimate traffic.
- tcp_tw_reuse on outbound, NOT tcp_tw_recycle (removed in 4.12, breaks NAT).
- Wide local port range + connection pooling are the real defence against TIME-WAIT exhaustion.
- BBR congestion control reduces p99 RTT and resists random loss better than CUBIC.
- TCP_NODELAY at the socket level eliminates Nagle/delayed-ACK stalls on small request payloads.
- Measure continuously; do not tune blind.
Why this works
Why QUIC runs over UDP instead of extending TCP. Every TCP feature must be implemented in kernels worldwide — a process that takes decades due to the long tail of un-updated systems. QUIC runs in user space (or as a library), so features can be added and deployed with a browser or server update rather than a kernel upgrade. The price is reinventing everything TCP provides (reliability, ordering, congestion control) in user space, but the benefit is the ability to evolve at Internet speed rather than kernel-update speed. TCP is not going away — it carries the vast majority of Internet traffic and will for decades — but QUIC represents the acknowledgement that TCP’s kernel-baked protocol ossification is a genuine engineering constraint.
- 01Explain why BBR sustains throughput on a path with 1% random loss where CUBIC collapses.
- 02What does ss -tin tell you about a live TCP connection that netstat does not?
- 03What is the relationship between TCP and QUIC, and why did QUIC not simply extend TCP?
BBR estimates the network’s bottleneck bandwidth and minimum RTT directly, ignoring loss as a congestion signal. CUBIC cuts the window on every loss event — on a path with 1% random loss, CUBIC settles at a fraction of capacity while BBR sustains near line rate. BBRv3 is deployed at Google, Cloudflare, and Netflix but is not yet in mainline Linux (early 2025). The production toolkit: ss -tin for live per-connection cwnd, RTT, and retransmit state; nstat for kernel counters; node_exporter for Prometheus SLOs. RST closes connections immediately without TIME-WAIT, enabling injection attacks on long-lived sessions. MPTCP spreads one connection across multiple network paths. kTLS moves TLS record encryption into the kernel for zero-copy static serving. QUIC runs TCP-like reliability in user space over UDP, decoupling protocol evolution from kernel upgrade cycles.
appears again in162
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior