Crux Read a BDP calculation, an ethtool counter dump, an SQM shaping config, and a leaf oversubscription sum — then pick the read a senior engineer would make.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Physical-link problems are diagnosed in counters, configs, and back-of-envelope sums — not prose. Read each artifact, predict the behaviour, and choose the move a senior engineer makes first.
Goal
Practise the loop you run on every link incident: read the number or the config, decide whether you are fighting physics or a misconfiguration, and reach for the right lever.
With BDP = 200 MB and a 64 KB receive window, what throughput does this connection actually get on the 10 Gbps link, and why?
Heads-up The receive window is a hard cap on bytes in flight, independent of NIC speed. With 64 KB you can never have more than 64 KB unacknowledged, so the link sits idle most of each RTT.
Heads-up On a high-BDP path the window directly bounds throughput: bytes-in-flight ÷ RTT. A window far below BDP is the classic under-utilisation bug — it costs throughput, not latency.
Heads-up Propagation sets the RTT, but here throughput is capped by the small window, not the floor. Window scaling fixes it with no change to the path.
The link negotiated 1 Gbps full-duplex, but rx_crc_errors is climbing steadily. What does this mean and what is the first action?
Heads-up CRC errors are about signal integrity on the medium, not CPU load. A busy CPU shows drops/overruns, not invalid frame check sequences.
Heads-up A failing CRC means the bits physically changed in transit before the driver ever saw them. Reloading the driver cannot fix a corrupted signal on the cable.
Heads-up Non-zero CRC errors mean data is silently corrupted at line rate and must be retransmitted; they are never normal on a healthy link and should be chased to root cause.
Snippet 3 — the SQM shaping config
# tc / CAKE on the WAN egress of a 40 Mbps-uplink home routertc qdisc replace dev wan root cake bandwidth 36Mbit docsis# ^^^^^^^ shaped BELOW the 40 Mbps line rate
Quiz
Completed
Why is the CAKE shaper set to 36 Mbit on a 40 Mbps uplink instead of the full 40, and what does the `docsis` keyword do?
Heads-up Setting it to line rate leaves the bottleneck in the modem's oversized buffer, where CAKE has no control — and bufferbloat returns. The sub-line-rate shape is the whole point of SQM.
Heads-up CAKE does not statically reserve a slice for voice, and `docsis` is framing compensation, not a priority class. The headroom exists to keep the bottleneck inside the router.
Heads-up This is the egress (upload) qdisc — exactly the direction a backup saturates. Shaping egress below line rate is what tames upload bufferbloat.
Snippet 4 — the leaf oversubscription sum
# One leaf switch in a GPU poddownlinks_to_servers = 8 * 200e9 # 8 servers x 200 Gbps RoCE NICsuplinks_to_spines = 2 * 400e9 # 2 x 400 Gbps to spinesoversub = downlinks_to_servers / uplinks_to_spines# = 1.6 Tbps / 0.8 Tbps = 2.0 -> 2:1
Quiz
Completed
The leaf is 2:1 oversubscribed. For a 64-GPU AllReduce job, what does this predict, and how do you reach non-blocking?
Heads-up GPU collectives are east-west and all-to-all — the worst case for oversubscription. 2:1 directly halves the bandwidth available to AllReduce, which is why the job stalls.
Heads-up More GPUs increase the all-to-all traffic, worsening the bottleneck. The fix is more uplink bandwidth per leaf, not more endpoints behind the same uplinks.
Heads-up Halving NIC speed lowers the demand but also halves per-GPU bandwidth — you hide the ratio by making everything slower. Non-blocking means raising uplink capacity to match the downlinks, not crippling the servers.
Recap
Every artifact pointed at the same separation: a 64 KB window on a 200 MB-BDP path is a throughput bug (raise the window, not the cable); climbing rx_crc_errors is a corrupted signal (swap the cable/SFP, never ignore it); CAKE shaped below line rate moves the bottleneck queue where you can discipline it; and a 2:1 leaf throttles an all-to-all collective by half. Read the number, decide physics-versus-config, then pick the matching lever.