awesome-everything RU
↑ Back to the climb

Networking & Protocols

BBR, production observability, and beyond TCP

Crux BBR''''s bandwidth-RTT model sustains throughput on lossy paths where CUBIC collapses. Production ss/tcpdump/nstat output, RST semantics, MPTCP, kTLS, and TCP''''s relationship to QUIC.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

A video-streaming service ships 4K segments over intercontinental cellular paths (150 ms RTT, 1% random loss). CUBIC throughput collapses to a fraction of the link’s capacity. Switching to BBR sustains throughput near line rate on the same path. The difference is not bandwidth — it is a different answer to the question “what does a dropped packet mean?”

BBR vs CUBIC vs Reno

Reno (classic): halve cwnd on loss, additive increase per RTT. Simple, widely implemented.

CUBIC (Linux default since 2.6.19): a cubic-curve growth function — concave probe below the previous max, convex probe above it. Reduces ~0.7× on loss. Recovers throughput on high-BDP paths faster than Reno.

BBR (Bottleneck Bandwidth and RTT): abandons loss-based signalling entirely. It estimates the path’s bottleneck bandwidth (via delivered-byte rate) and minimum RTT (via packet timestamps) directly, then paces sends to match. Loss is treated as ambiguous noise — it could be congestion or random drop. On 1% random loss, BBR sustains near-line-rate throughput; CUBIC settles into a small fraction.

BBRv3 (Google’s 2023 release): fixes BBRv2’s premature-probe and convergence bugs. Deployed across google.com, YouTube, Cloudflare, and Netflix edges. Not merged into mainline Linux as of early 2025; requires a custom kernel or third-party backport. Mainline Linux 6.x ships BBR 1.x.

Practical guidance: stick with CUBIC for general-purpose servers on standard kernels, standard datacenter and regional ISP paths where loss is rare and loss genuinely means congestion. Switch to BBR for cross-continental WAN, cellular, or satellite paths where 0.5–2% random loss is normal and CUBIC’s cuts collapse throughput.

Set per-socket: setsockopt(SOL_TCP, TCP_CONGESTION, "bbr"). System-wide: net.ipv4.tcp_congestion_control=bbr.

Trace it
1/5

Trace a BBR congestion episode on a lossy cellular path and explain why BBR sustains throughput while CUBIC collapses.

1
Step 1 of 5
A phone on cellular (1% random packet loss, 50 ms RTT, 20 Mbps bottleneck) runs CUBIC. What happens to cwnd?
2
Locked
The same path switches to BBR. How does BBR's loss response differ?
3
Locked
The phone loses 1% of packets. Does BBR's estimate change?
4
Locked
CUBIC is now competing with BBR on the same link. What is the trade-off?
5
Locked
What kernel setting enables BBR per-socket, and what is the caveat in 2026?
Pick the best fit

A video-streaming service ships 4K segments over long-RTT cellular networks (RTT often greater than 150 ms, sporadic loss 0.5–2%). Pick the TCP congestion control algorithm + tuning combination.

Production observability

ss -tin dumps live cwnd, RTT, RTT variance, retransmits, and backoff state per connection — no kernel overhead. Run it on any production host.

$ ss -tin state established | grep -A1 "dport 443"
# Output includes: cwnd:10 ssthresh:2147483647 rtt:42.8/8.5 acked:142 retrans:0/0

Key fields:

  • cwnd: current congestion window in MSS
  • rtt/rttvar: smoothed RTT and variance in milliseconds
  • retrans:sent/outstanding: total retransmissions
  • ssthresh: slow-start threshold (2147483647 = infinity = in slow start)

ss -s summarises socket counts by state — watch CLOSE-WAIT spikes.

nstat exposes counters from /proc/net/netstat: RetransSegs, TCPSlowStartRetrans, TCPDSACKRecv. For long-term monitoring, tcp_diag feeds Prometheus exporters (node_exporter exposes most metrics); the SLO-relevant metrics are retransmission rate, p95/p99 RTT, and the CLOSE-WAIT:ESTABLISHED ratio.

Debug this

ss output during an outage — diagnose the issue

log
$ ss -tan state established | wc -l
12384
$ ss -tan state close-wait | wc -l
9821
$ ss -tan state time-wait | wc -l
1247
$ ss -s
Total: 12500
TCP:   23552 (estab 12384, closed 8920, orphaned 2, timewait 1247)
$ ps -p 1234 -o pid,stat,rss,vsz,cmd
PID STAT  RSS    VSZ CMD
1234 Ssl 8392000 12000000 /usr/bin/app-server

12k ESTABLISHED + 9.8k CLOSE-WAIT sockets and RSS is climbing. What is the bug and the fix?

RST semantics

A TCP RST is an abrupt connection close — no FIN exchange, no TIME-WAIT, the receiver drops connection state immediately. It occurs when:

  • A packet arrives for a port no one is listening on.
  • The application calls close() on a socket with unread data and SO_LINGER with lingertime=0.
  • The peer sends garbage that violates the state machine.
  • A stateful firewall decides the connection is idle.

RST attacks: an attacker who can guess sequence numbers within the receive window can forge an RST and tear down an established connection. RFC 5961 tightens the acceptable RST window. Long-lived idle connections (BGP sessions, SSH) are most vulnerable.

MPTCP (RFC 8684)

Multipath TCP carries one logical connection across multiple paths (Wi-Fi + cellular, multi-NIC server). The MPTCP handshake adds an MP_CAPABLE option in SYN/SYN-ACK/ACK; if both ends support it the first sub-flow is established, and additional sub-flows can be opened on different interfaces via MP_JOIN. iOS uses MPTCP since iOS 7 for Siri. Linux 5.6+ ships RFC 8684. Limited adoption outside Apple because middleboxes that do not understand the option fall back to plain TCP.

kTLS + TCP

kTLS (Linux 4.13+ TX, 4.17+ RX, NIC offload in 6.0+) moves symmetric TLS record encryption into the kernel via setsockopt(SOL_TLS, ...). After the user-space TLS handshake completes, the kernel takes over record encryption; combined with sendfile(), files move from page-cache to NIC without entering user space. Netflix reports 8–29% CPU savings on static asset delivery. kTLS does not change TCP behaviour — congestion control, retransmits, window management all remain standard.

TCP’s relationship to QUIC

TCP is one layer in the stack; TLS sits directly on top; HTTP/1.1 and HTTP/2 ride on TLS. HTTP/3 is the exception — it runs on QUIC, which uses UDP and reinvents reliability and congestion control in user space. The reason: evolving TCP in kernel space proved too slow. The lessons of TCP — sequence numbers, ACKs, congestion control, slow start, fast retransmit — all reappear in QUIC, just at a different layer. Understanding TCP makes QUIC mechanistically transparent; the inverse is not true.

Which RFC?

Which RFC specifies RACK-TLP, the modern loss detection algorithm used by Linux to avoid waiting for the RTO timer?

Design challenge

Design the kernel-tunable set for a high-traffic API gateway terminating 200k HTTPS connections/second. Outbound traffic to ~50 backend pools, mostly short HTTP/1.1 requests with keep-alive.

  • No external dependencies beyond Linux >= 6.0 sysctl.
  • Resist SYN floods targeting the public-facing listener.
  • Avoid TIME-WAIT exhaustion on outbound traffic to the backend pools.
  • Keep latency p99 under 50 ms under steady-state load.
Why this works

Why QUIC runs over UDP instead of extending TCP. Every TCP feature must be implemented in kernels worldwide — a process that takes decades due to the long tail of un-updated systems. QUIC runs in user space (or as a library), so features can be added and deployed with a browser or server update rather than a kernel upgrade. The price is reinventing everything TCP provides (reliability, ordering, congestion control) in user space, but the benefit is the ability to evolve at Internet speed rather than kernel-update speed. TCP is not going away — it carries the vast majority of Internet traffic and will for decades — but QUIC represents the acknowledgement that TCP’s kernel-baked protocol ossification is a genuine engineering constraint.

Recall before you leave
  1. 01
    Explain why BBR sustains throughput on a path with 1% random loss where CUBIC collapses.
  2. 02
    What does ss -tin tell you about a live TCP connection that netstat does not?
  3. 03
    What is the relationship between TCP and QUIC, and why did QUIC not simply extend TCP?
Recap

BBR estimates the network’s bottleneck bandwidth and minimum RTT directly, ignoring loss as a congestion signal. CUBIC cuts the window on every loss event — on a path with 1% random loss, CUBIC settles at a fraction of capacity while BBR sustains near line rate. BBRv3 is deployed at Google, Cloudflare, and Netflix but is not yet in mainline Linux (early 2025). The production toolkit: ss -tin for live per-connection cwnd, RTT, and retransmit state; nstat for kernel counters; node_exporter for Prometheus SLOs. RST closes connections immediately without TIME-WAIT, enabling injection attacks on long-lived sessions. MPTCP spreads one connection across multiple network paths. kTLS moves TLS record encryption into the kernel for zero-copy static serving. QUIC runs TCP-like reliability in user space over UDP, decoupling protocol evolution from kernel upgrade cycles.

Connected lessons
appears again in162
Continue the climb ↑TCP handshake: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.