Infrastructure: The Unsung Hero

Everyone training models has opinions about architecture and data. Almost nobody understands the hardware their model runs on. The infrastructure layer — GPUs, memory hierarchies, interconnects — gets treated like a utility: plug in compute, get tokens per second out. This is a mistake, and an expensive one.

The HuggingFace team trained SmolLM3 on 384 H100s for nearly a month, processing 11 trillion tokens. By their own account, it was not a smooth ride. Node failures, storage issues, run restarts. The ability to handle those problems — to anticipate them, recover from them, and keep training low-maintenance — depends entirely on understanding the hardware layer. Not at the level of a chip architect, but enough to reason about where your bottlenecks are and what to do about them.

This chapter is about developing that understanding. We'll start inside a single GPU — what it's actually made of, how it computes, how it moves data. Then we'll zoom out to how GPUs talk to each other, through shared memory, through custom interconnects, across nodes over a network. At every level, the question is the same: where is the bottleneck, and what does it cost you?

Inside a GPU: Compute and FLOPs

A GPU is a massively parallel processor built for throughput, not latency. Where a CPU optimizes for executing a small number of complex instruction streams as fast as possible, a GPU achieves performance by running thousands of simpler operations simultaneously. The fundamental architectural difference is not speed — it is breadth.

The basic compute unit is the Streaming Multiprocessor, or SM. The H100 has 132 of them. Each SM runs groups of 32 threads called warps in lockstep — all threads in a warp execute the same instruction at the same time on different data. When one warp stalls waiting on memory, the SM switches to another warp instantly, hiding the latency. This is the core mechanism that lets GPUs achieve high utilization: rather than waiting for slow memory, they keep dozens of warps in flight.

Inside each SM are two types of cores. CUDA cores handle general floating-point arithmetic. Tensor cores are specialized for matrix multiplication — specifically the fused multiply-add operations that dominate transformer workloads. Tensor cores are why precision matters so much when discussing FLOPs. The H100 delivers 67 TFLOPs at FP32, 990 TFLOPs at BF16, and 3,960 TFLOPs at FP8. That is not a coincidence — it reflects how much more work the Tensor cores can pack into the same silicon at lower precision, when the operations can be mapped to their native format.

The practical caveat: theoretical peak FLOPs are rarely achieved. The HuggingFace team benchmarked their H100s against real Llama 70B training matrix shapes and found BF16 throughput of 714–758 TFLOPs — about 72–77% of the theoretical 990 TFLOPs peak. End-to-end training efficiency (model FLOPs utilization, or MFU) is lower still: Meta achieved 38–41% for Llama 3 405B; the SmolLM3 3B run achieved roughly 30%. Much of the gap comes from communication overhead in distributed training. When planning runs, use achievable numbers, not marketing specs.

The Memory Hierarchy

Here is the thing about GPU performance that takes time to internalize: for most practical deep learning workloads, the bottleneck is not compute. It is data movement. A GPU can have teraflops of theoretical compute, but if the memory system cannot feed it fast enough, those compute units sit idle. Understanding the memory hierarchy is really understanding where your time goes.

Modern GPUs organize memory in layers, each faster but smaller than the one below it. At the bottom is HBM — High Bandwidth Memory, the GPU's main memory. The H100 has 80 GB of HBM3 with a theoretical bandwidth of 3.35 TB/s. Above that is a 50 MB L2 cache shared across the GPU at around 13 TB/s. Each SM has its own L1 cache and shared memory, 256 KB combined, at roughly 31 TB/s. At the top, registers — private to individual threads — operate at effective bandwidths approaching 100 TB/s per SM.

Fig. 1 — H100 Memory Hierarchy: Size vs. Bandwidth

Each memory tier trades capacity for bandwidth. Registers are fastest but tiny. HBM is large but slow relative to compute throughput. The gap between HBM and registers is roughly 30×. Most kernels spend their time waiting on HBM.

The practical implication is captured by a principle from Horace He: for memory-bound operations, computation is effectively free. If your kernel reads data from HBM, does a few operations, and writes back — the time is dominated entirely by the HBM access. Adding more arithmetic in between costs nothing, as long as you're not changing the memory access pattern. This is why operator fusion works so well.

Flash Attention is the canonical example. Standard attention materializes the full N×N attention matrix in HBM: write scores, read for softmax, write again, read for the V multiplication. Flash Attention tiles the computation so intermediate results stay in fast SRAM, never touching HBM until the final output. The result is 2–4× speedups and O(N) rather than O(N²) HBM accesses — not because it does less compute, but because it does far less data movement.

There is a diagnostic tool for understanding whether your kernel is compute-bound or memory-bound: the roofline model. Plot your kernel's arithmetic intensity (FLOPs per byte moved) on one axis, achieved performance on the other, and the roofline tells you which ceiling you're hitting. A kernel in the memory-bound region cannot be improved by using faster Tensor cores. It needs to move less data — through fusion, better access patterns, or higher arithmetic intensity.

How GPUs Talk to the World

A single GPU is only part of the story. Before it can compute anything, data must be loaded from somewhere. In distributed training, GPUs must constantly exchange gradients, activations, and weights with each other. The bandwidth and latency of these external communication links determine how much of your theoretical GPU performance you can actually use.

There are four external links that matter: CPU to GPU, GPU to GPU within a node (intranode), GPU to GPU across nodes (internode), and GPU to storage. Each has completely different performance characteristics, and bottlenecking on any one of them can tank your overall throughput.

CPU to GPU: The PCIe Bottleneck

The CPU's job in training is to schedule work on the GPU — launching kernels, managing memory allocations, coordinating data transfers. It does this over a PCIe connection. In typical AWS P5 instances (H100 nodes), the CPU-to-GPU path runs through two PCIe hops: PCIe Gen4 x8 from the CPU to a PCIe switch (theoretical 15.75 GB/s), then PCIe Gen5 x16 from the switch to the GPU (63 GB/s). The bottleneck is the first hop.

Actual measured bandwidth for CPU-to-GPU transfers peaks around 14.2 GB/s — about 90% of the PCIe Gen4 x8 theoretical limit. Round-trip latency for a CPU-to-GPU operation is approximately 1.4 microseconds. That sounds small, but for workloads that launch many small kernels, or that require frequent CPU–GPU synchronization (some mixture-of-experts implementations fall into this trap), the accumulated latency becomes a real cost. CUDA Graphs can help by capturing a sequence of operations and replaying them as a single unit, eliminating per-kernel round-trip overhead.

On multi-socket systems, NUMA affinity matters. If your GPU process runs on CPU cores in a different NUMA node than the one physically connected to the GPU, memory accesses must traverse the CPU interconnect — in tested AMD EPYC configurations, the cross-NUMA memory access latency is about 3.2× higher than same-socket access. This is invisible in aggregate statistics and can cause mysterious throughput degradation.

GPU to GPU: The Interconnect Ladder

Within a single node, GPUs can communicate in three ways, with dramatically different performance characteristics. Understanding the gap between them is one of the more practically useful things in distributed training.

Fig. 2 — Intranode GPU-to-GPU Bandwidth (H100, 2-GPU SendRecv)

Measured NCCL SendRecv bandwidth between two H100 GPUs via three paths. NVLink is 9× faster than EFA and 112× faster than routing through the CPU. NCCL automatically selects NVLink when available.

The slowest path routes data through CPU memory: GPU1 → PCIe switch → CPU RAM → PCIe switch → GPU2. Measured throughput: ~3 GB/s. This path saturates both PCIe links and CPU memory buses, and gets worse when multiple GPUs compete for the same CPU memory bandwidth simultaneously.

GPUDirect RDMA over EFA (Amazon's Elastic Fabric Adapter) bypasses the CPU entirely, using direct memory access between GPU buffers. With four EFA NICs per GPU on P5 instances (each providing 100 Gbps), this path achieves ~38 GB/s — a meaningful improvement, but still a fraction of what NVLink offers.

NVLink is the answer. NVIDIA's direct GPU-to-GPU interconnect, NVLink 4.0 on H100, provides 900 GB/s theoretical bidirectional bandwidth per GPU. Measured bidirectional bandwidth across all GPU pairs on a DGX H100 node: 786 GB/s, or 85% of theoretical. The DGX topology uses four NVSwitches connecting eight GPUs, ensuring every GPU pair has a single-hop path. NCCL automatically prioritizes NVLink for intranode communication when available — the practical lesson is to not accidentally disable it with misset environment variables.

There is a further optimization for collective operations (all-reduce, all-gather) called NVLink SHARP (NVLS), which offloads the reduction computation to the NVSwitches themselves rather than pulling data back to the GPUs. For all-reduce operations, this provides roughly a 1.3× throughput improvement. For all-to-all operations — as used in mixture-of-experts routing — NVLS does not help, which is one of the reasons MoE architectures face tighter communication constraints.

Internode: The Network Wall

When training spans multiple nodes, the interconnect drops from NVLink's 900 GB/s to whatever the cluster network provides. On AWS P5, this is EFA at 3,200 Gbps (400 GB/s) total per node — across 32 EFA NICs. Per GPU, the available internode bandwidth is far lower than intranode NVLink. This is the fundamental constraint that makes internode communication a bottleneck in large-scale training and why parallelism strategy matters so much.

Tensor parallelism — splitting a single transformer layer across GPUs — requires extremely fast communication between those GPUs because each forward pass involves all-reduces across the split. This is why tensor parallelism is usually kept within a node, where NVLink bandwidth is available. Pipeline parallelism — splitting layers across nodes — works at the granularity of full activations passed between stages, which can be overlapped with computation and tolerates higher latency. Data parallelism — running identical models on different data — requires gradient synchronization at the end of each backward pass, which can be overlapped with the next forward pass. These strategies compose, and understanding the bandwidth characteristics of each link level is what determines how to compose them.

Storage: The Forgotten Tier

Data has to come from somewhere. In large training runs, the storage layer — NVMe drives, distributed file systems, object stores — is a real bottleneck that often goes undiagnosed because it doesn't show up in GPU utilization metrics. A GPU that is waiting on data to load from disk looks the same as a GPU that is idle; both report low utilization.

The practical rule is that your data pipeline must be able to saturate your GPU's memory bandwidth before the GPU finishes processing the previous batch. On P5 instances, local NVMe provides high sequential read bandwidth, but random access patterns or slow preprocessing can break this. The solution is to pre-tokenize data offline, write compact binary formats, and use multiple dataloader workers to prefetch aggressively. For multi-node training, shared storage becomes a contention point when hundreds of processes try to read from the same filesystem simultaneously.

Resilience: The Real Skill

Understanding benchmarks is the easy part. The hard part is keeping a month-long training run alive across hardware that will inevitably fail. At the scale of SmolLM3 — 384 GPUs, 30 days — node failures are not edge cases. They are scheduled events you have not yet received notice of.

The essentials: checkpoint frequently, at intervals short enough that losing a checkpoint is a minor inconvenience rather than days of lost work. Checkpoint asynchronously so the write doesn't block training. Design your training loop to resume cleanly from any checkpoint — including the dataloader state, so you don't re-train on data the model has already seen. Monitor your jobs actively; a stalled run with no alert can waste hours of cluster time before anyone notices.

Node health checks before long runs are underrated. Hardware passes burn-in tests at purchase and then quietly develops problems over months of operation. A brief validation run — testing GPU-to-GPU bandwidth, verifying NVLink topology, confirming memory bandwidth on every node — is cheap insurance before committing to a multi-day job on potentially degraded hardware. The HuggingFace team learned to run these checks as standard practice after the SmolLM3 training experience.

What Infrastructure Actually Teaches

Working through this layer changes how you think about the abstractions above it. When you understand that your model's activations need to travel through multiple levels of cache, each with different bandwidth and latency, you start to see architectural choices differently. Flash Attention is not just a clever algorithm — it is a direct response to the specific shape of the H100 memory hierarchy. The choice of parallelism strategy is not just about scaling — it is about which communication links are cheap and which are expensive.

There is also something humbling about it. The gap between a 990 TFLOPs theoretical peak and a 30% MFU at the end of a real training run represents a lot of wasted potential — communication overhead, memory bandwidth limits, load imbalances, kernel launch latency. Some of that gap is recoverable with better engineering. Some of it is just the cost of doing distributed computation on hardware that was not designed specifically for your workload.

The previous chapters covered what your model learns and how to read its training signals. This one is about the substrate that learning happens on. The tokens flowing through transformer layers, the gradient updates propagating backward through the computation graph — all of that is ultimately electrons moving through silicon along paths constrained by physics and engineering tradeoffs made years before you started your training run. Understanding those constraints is not optional. It is what lets you use them well.