Networking & Protocols
Latency math
A support ticket says “latency spiked from 30 ms to 500 ms.” Is that the database? The network? The CDN? Before you read a single log, you need the floor numbers — the latency physics forces on every route. Once you know them, anomalies become obvious.
The latency formula
One-way propagation delay = distance / signal speed.
Light in vacuum: 300,000 km/s. Light in glass fibre: ~200,000 km/s (the glass slows it by ~33%). These numbers give you latency floors no software can beat.
- NYC → London (5,500 km)
- 28 ms min → 70–90 ms real RTT
- NYC → Sydney (16,000 km)
- 80 ms min → 200–220 ms real RTT
- Same continent (2,000 km)
- 10 ms min → 20–30 ms RTT
- LAN (100 m, Cat6)
- <0.5 µs propagation
- LEO satellite (550 km alt)
- ~20–50 ms RTT total
- GEO satellite (36,000 km alt)
- ~600 ms RTT total
Per-technology latency
| Technology | Typical RTT | Bandwidth | Notes |
|---|---|---|---|
| GEO satellite | ~600 ms | 25–100 Mbps | Physics: 36,000 km altitude |
| LEO satellite (Starlink) | 20–50 ms | 50–300 Mbps | Much closer orbit |
| DOCSIS cable (loaded) | 50–200 ms | 100–500 Mbps | Bufferbloat under saturation |
| 4G LTE | 30–60 ms | 10–100 Mbps | Scheduling adds to propagation |
| 5G sub-6 GHz | 15–30 ms | 100 Mbps–1 Gbps | Better scheduling |
| FTTH fibre | 2–10 ms | 1 Gbps symmetric | ISP edge to home |
| Gigabit LAN | <1 ms | 1 Gbps | Within building |
Real RTT is always higher than the propagation floor because routing adds distance, each router adds a small processing delay (~1 µs for modern hardware), and queuing can add milliseconds under load.
Why real RTT exceeds the floor
Take NYC → London (theoretical floor: 55 ms RTT at 200,000 km/s). Real RTT is 70–90 ms — 30–60% above the floor. The excess comes from:
- Routing overhead: cable routes are not straight lines; the actual cable path is longer than the great-circle distance.
- Router processing: each intermediate router reads the IP header and looks up the routing table (~microseconds each, dozens of hops).
- Serialisation delay: time to push a full packet (1500 bytes) onto the wire at the link rate. At 1 Gbps: 1500 × 8 / 10⁹ = 12 µs. Negligible for high-bandwidth links, significant at 10 Mbps.
- Queueing delay: at any bottleneck link, packets wait behind others. Under load this can add tens of milliseconds — covered in lesson 04.
The Internet’s submarine backbone
~500 submarine cables connect the continents. Key facts:
- Each cable carries 10–30 Tbps via DWDM (Dense Wavelength-Division Multiplexing): dozens of wavelengths on one fibre pair, each wavelength ~100–400 Gbps.
- EDFAs (erbium-doped fibre amplifiers) regenerate the optical signal every ~80 km without converting to electrical.
- Cable failures (anchor strikes, undersea landslides, ship anchors) happen monthly; redundancy and BGP rerouting keep traffic flowing.
- Hyperscalers (Google, Meta, Microsoft) own private cables — MAREA, Dunant, Curie — to guarantee capacity for their traffic.
A developer says 'we just upgraded our NYC→Sydney link to 100 Gbps and latency didn't improve.' Why not?
Practical debugging at the link layer
When a network path behaves unexpectedly, these tools locate the layer:
# Linux: interface counters — errors, drops, overruns
ip -s link show eth0
# NIC settings: speed, duplex, auto-neg
ethtool eth0
# Vendor NIC statistics: CRC errors, link restarts
ethtool -S eth0 | grep -E "rx_crc|rx_error|tx_error"
# Wi-Fi signal and rate
iw dev wlan0 link
# Traceroute with ICMP timestamps (shows per-hop RTT)
traceroute -I 8.8.8.8
mtr --report 8.8.8.8Interpreting what you see:
- rx_crc_errors > 0: frames arriving garbled — bad cable, dirty SFP, or marginal signal. Replace cable or transceiver first.
- Link auto-negotiated to 100 Mbps when you expected 1 Gbps: cable or port issue forced fallback. Replace the cable (Cat5e damaged pair).
- Traceroute RTT jump at hop N: latency added at that router or the link between hop N-1 and N. Not necessarily the router’s fault — ICMP rate-limiting can make it look slow.
- Wi-Fi “connected at 54 Mbps”: client is using legacy 802.11g rates — very far from AP, or old device. Move AP or device closer.
Order link technologies from highest to lowest typical real-world round-trip latency:
- 1 GEO satellite (~600 ms RTT — orbit is 36,000 km)
- 2 DOCSIS cable modem under load with bufferbloat (~100–200 ms RTT)
- 3 LEO satellite Starlink (~20–50 ms RTT — orbit only 550 km)
- 4 4G LTE (~30–60 ms RTT)
- 5 FTTH fibre home connection (~5–10 ms RTT to ISP edge)
- 6 Gigabit LAN Ethernet (<1 ms RTT in building)
Propagation delay calculation
1/3Why this works
Why traceroute lies. Many routers rate-limit ICMP packets used by traceroute — you may see * * * hops or unusually high RTT at a hop even though the path beyond it is fine. Use mtr (Matt’s Traceroute) for a live view that aggregates many probes, or traceroute -T (TCP mode) which router ACLs less often block. Never conclude “the problem is at hop N” just because traceroute shows high RTT there unless everything beyond it is also broken.
- 01Light in glass travels at ~200,000 km/s. NYC → Sydney is ~16,000 km. What is the theoretical one-way propagation delay, and why is the real RTT 200–220 ms rather than 160 ms?
- 02How does DWDM multiply fibre capacity without needing more fibre?
- 03rx_crc_errors is non-zero on a 10G NIC. What does this indicate and what do you do?
Propagation delay = distance ÷ signal speed (~200,000 km/s in glass). The floors: 28 ms one-way transatlantic, 80 ms transpacific, ~4 ms to LEO satellite, ~120 ms to GEO. Real RTT exceeds the floor by 25–60% due to routing geometry, router processing, and queuing. The Internet’s backbone is ~500 submarine cables using DWDM (dozens of wavelengths per fibre) with EDFA amplifiers every 80 km. Key debugging tools: ip -s link show, ethtool -S (CRC errors, link state), mtr (traceroute with aggregated probes). When traceroute shows high RTT at a single hop but the path beyond is healthy, it’s usually ICMP rate-limiting — not a real bottleneck.
appears again in162
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior