Networking & Protocols NET · 01 · 05

The datacentre fabric

Clos spine-leaf topology, RoCE for GPU clusters, 800G optics, kernel-bypass NICs, and the power-and-cooling ceiling — the physical layer at hyperscale.

NET Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

A GPU training job that should finish in 6 hours is taking 18. The GPUs are 90% idle, waiting. Nothing is wrong with the model or the code — the bottleneck is the datacentre fabric, the physical mesh of switches and fibre that moves gradients between machines.

Clos topology: spine and leaf

A classic two-tier or three-tier Clos fabric replaces a single large switch — which would be a single point of failure and a hardware engineering nightmare — with a mesh of commodity switches. In a spine-leaf design every top-of-rack (ToR) leaf switch has one uplink to every spine switch. No two leaves are connected directly; all traffic flows through the spines.

This gives you three things at once. Non-blocking bisection bandwidth: if you have N leaves and M spines, and each leaf has one M-speed uplink per spine, then cutting the fabric in half exposes N/2 × M uplinks on each side — any server can saturate its NIC to any other server simultaneously, because every leaf has equal uplink capacity. No spanning-tree loops: ECMP (Equal-Cost Multi-Path) hashes each flow’s 5-tuple across the available spine uplinks, so all paths are active simultaneously — no ports blocked by STP. Horizontal scaling: add more spines to add bisection bandwidth; add more leaves to add server ports. You scale without forklift upgrades.

Modern hyperscale fabrics run 400G Ethernet between leaf and spine, sometimes already 800G (IEEE 802.3df, finalised February 2024). The uplinks are Direct Attach Copper (DAC) cables inside a rack and pre-terminated MTP/MPO multi-fibre trunk cables between racks and rows — structured cabling at a scale that makes individual patch cables impractical.

Hyperscale fabric quick reference

Spine↔leaf link speed (2024): 400G Ethernet; 800G in leading builds
Server NIC (GPU cluster): 100–400 Gbps RDMA NIC
Oversubscription target (AI training): 1:1 (full bisection bandwidth)
PFC pause storm risk (RoCE): freezes a fabric region if mis-configured
GPU rack power draw: 50–100 kW (air cooling ceiling ~15 kW)
800G Ethernet standard: IEEE 802.3df, approved Feb 2024

RoCE and lossless Ethernet

If you have ever wondered why ML training jobs on a cloud GPU cluster run three times slower than on the same hardware in a private lab, the answer is often here: the cloud fabric is running TCP-based collective communication through the kernel, while the private cluster uses RDMA directly over a lossless fabric. The difference is not the GPU; it is the network contract.

GPU training requires collective communication — every node needs to exchange gradient updates with every other node in operations like AllReduce. With normal TCP, the kernel copies data from GPU → CPU → NIC → network → NIC → CPU → GPU. That double copy burns CPU cycles and adds latency that serialises the collective, idle-ing the GPUs.

RDMA (Remote Direct Memory Access) lets a NIC write directly into a remote machine’s memory with zero CPU involvement and zero copy. Over InfiniBand this was standard; over standard Ethernet it became RoCE (RDMA over Converged Ethernet). The catch: RDMA has no built-in retransmission. A single dropped packet stalls or aborts the RDMA operation, so the fabric must be lossless.

Two mechanisms make Ethernet lossless:

PFC (Priority Flow Control) — when a switch buffer fills beyond a threshold, it sends a PAUSE frame upstream on a per-traffic-class basis, applying backpressure all the way to the sender. No packet is dropped; the sending port is held off instead.
ECN (Explicit Congestion Notification) — as a switch queue builds, the switch marks packets with a congestion bit (instead of dropping). The receiver reflects the mark back to the sender (DCQCN in RoCE v2), which reduces its rate before PFC has to act.

Lossless Ethernet is not automatic — it requires per-port, per-priority flow-control knobs to be tuned precisely. Over-aggressive PFC creates pause storms (backpressure propagates across the fabric, freezing unrelated flows) or deadlock (circular dependency of paused ports). This is why GPU clusters are typically on a physically isolated lossless fabric, separate from the general IP network.

▸Why this works

Why SmartNICs and DPUs exist. As server NICs hit 400 Gbps, the host CPU can no longer keep up with packet processing for virtualised tenants — the NIC alone can generate more interrupts than a CPU can handle. The answer is to move the work off the host entirely. A SmartNIC (e.g. NVIDIA ConnectX-7) handles SR-IOV, VXLAN encapsulation/decapsulation, and traffic policing in NIC silicon. A DPU (Data Processing Unit, e.g. AWS Nitro, NVIDIA BlueField) goes further: it runs a full OS on an ARM core inside the NIC, handles the entire VM networking stack — security groups, VPC routing, encrypted overlay — with zero CPU cycles from the tenant. AWS Nitro offloads all networking and storage I/O off the host CPU, giving tenants 100% of the CPU they pay for.

Power and cooling: the real ceiling

A standard 42U rack with 1U servers draws 5–15 kW. A rack of eight H100 GPUs draws 50–100 kW. Air cooling moves heat by blowing air over heatsinks and out the back of the rack; it handles roughly 15 kW per rack — the ASHRAE Class A2 envelope. A 100 kW rack needs liquid cooling.

A GPU rack overshoots the air-cooling ceiling roughly 7×, so liquid cooling stops being optional.

Two approaches dominate. Direct-to-chip (DtC) runs coolant through cold plates bolted directly to the GPU die and voltage regulators, removing heat where it is generated. Air still handles the remaining ~20% of rack heat (storage, NICs, fans). Full immersion submerges server boards in a dielectric fluid (engineered oil or a fluorocarbon). Immersion handles 100% of the rack heat with no fans, and the thermal mass of the fluid smooths out demand spikes — but it complicates maintenance and cable management.

The hard constraint for new GPU deployments is often not switch ports or fibre — it is power capacity. A datacentre hall sized for 10 kW/rack is already paid for in HVAC and PDUs; retrofitting it for 100 kW/rack racks means replacing power feeds, cooling loops, and possibly the building transformer. That is a multi-year capital project. Stranded power is why hyperscalers now buy or build facilities expressly for GPU density from the ground up.

Trace it

1/4

A distributed training job on 64 GPUs (8 servers × 8 GPUs) runs 3× slower than expected. Diagnose the fabric bottleneck.

Step 1 of 4

Step 1: GPU utilisation is 35%. Is the network the bottleneck?

Locked

Step 2: NCCL wait time is 65% of each training step. You see PFC PAUSE frames on every leaf uplink. What does that mean?

Locked

Step 3: the leaf has 2 × 400G uplinks but 8 × 200G server NICs. What is the oversubscription ratio, and what is the fix?

Locked

Step 4: you fix the oversubscription. Pause frames stop but training is still 1.5× slow. What next?

Debug this

Fabric health: leaf uplink diagnostics

log

Design challenge

Design a non-blocking 512-GPU fabric for an AI training cluster. Servers have 400 Gbps RoCE NICs.

Fabric oversubscription ratio

1/3

Every leaf uplinks to every spine; leaves never connect to each other. ECMP hashes each flow across the equal-cost uplinks, so all paths are active at once — non-blocking bisection bandwidth. Add spines for more cross-section bandwidth, add leaves for more server ports.

Recall before you leave

01
What is non-blocking bisection bandwidth in a Clos fabric, and how does ECMP achieve it?
02
Why does RoCE require lossless Ethernet, and what two mechanisms provide it?
03
Why is power and cooling now the limiting factor for datacentre GPU deployment?

Recap

A hyperscale datacentre fabric is a Clos spine-leaf mesh: each top-of-rack leaf has one uplink to every spine switch, and ECMP hashes flows across those uplinks to deliver non-blocking bisection bandwidth — 400G/800G optical in the spine, 100–400G to servers. GPU clusters use RoCE to write directly into remote memory with no kernel copy, which forces lossless Ethernet via PFC and ECN; tune the flow control wrong and a pause storm freezes the fabric. The data plane is steadily moving off the host CPU into SmartNICs and DPUs — AWS Nitro offloads VM networking entirely. 800G Ethernet (IEEE 802.3df) ships today, but the binding constraint is now power: 50–100 kW GPU racks demand liquid cooling and enough provisioned power that capacity is not stranded. Now when you see a GPU training job running at 30% GPU utilisation, your first check is the fabric: measure the oversubscription ratio, look for PFC PAUSE counters on the leaf uplinks, and verify that RoCE traffic is on a dedicated lossless priority class.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Bufferbloat and congestionsenior

unlocks

The physical frontiersenior

deepens into

The physical frontiersenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.