Data Parallelism splits the batch. Tensor Parallelism splits the layers horizontally. But what happens when the model is too large to fit on a single node, the sequence is 128k tokens long, or the architecture has 256 specialized experts? You need more dimensions.
The previous chapter covered the first two scaling strategies: DP (with ZeRO) and TP (with SP). Together they handle batch-dimension and hidden-dimension parallelism. This chapter introduces the remaining three: Pipeline Parallelism splits the model vertically across nodes; Context Parallelism splits long sequences across the attention computation; Expert Parallelism distributes MoE experts. Then we put all five together and face the real question: given your model, your cluster, and your target batch size, which combination should you actually use?
Pipeline Parallelism: Splitting by Depth
Tensor Parallelism's communication sits in the critical path of every layer — all-reduces that cannot be hidden behind computation. This is why TP struggles to scale past the intra-node NVLink boundary. For models too large to fit on a single node's worth of GPUs (70B+ parameters), we need a different strategy.
Pipeline Parallelism (PP) solves this by splitting the model's layers across GPUs vertically. GPU 1 takes layers 1–4, GPU 2 takes layers 5–8, and so on. Each GPU only stores and computes its assigned layers. The forward pass becomes a sequential pipeline: GPU 1 processes a micro-batch and sends the activations to GPU 2, which processes them through its layers and sends them forward.
The bandwidth requirement is comparatively gentle. Unlike TP, which communicates several times within each layer, PP only sends activations at a handful of layer boundaries. This makes PP well-suited for inter-node communication, where bandwidth is lower but the communication is infrequent enough to tolerate it.
The problem is the bubble.
The Bubble Problem
In a naive implementation — forward through all stages, then backward through all stages — each GPU sits idle while waiting for others to finish. If you have p pipeline stages and the forward + backward time for one micro-batch through one stage is (t_f + t_b), the bubble overhead ratio is:
With 8 pipeline stages, the bubble is 7× the useful computation time. Most of your GPUs are idle most of the time. This is obviously unacceptable. The history of pipeline parallelism is largely the history of clever scheduling to shrink this bubble.
Micro-batching (AFAB): The first tool is splitting the batch into m micro-batches. While GPU 2 processes micro-batch 1, GPU 1 can start micro-batch 2. In the All-Forward-All-Backward (AFAB) schedule, you run all forward passes first, then all backward passes. The bubble shrinks:
With 8 stages and 32 micro-batches, the bubble drops to about 22% — painful but workable. The downside: you must store activations for all m micro-batches simultaneously, which can explode memory.
1F1B (One-Forward-One-Backward): The insight here is to start backward passes as soon as possible. Once the first micro-batch completes its forward pass through all stages, start the backward pass immediately — even while later micro-batches are still in their forward pass. The steady state alternates: one forward, one backward, one forward, one backward. The bubble is the same size as AFAB (same formula), but activation memory drops from m stored micro-batches to just p. This lets you safely increase m, which further shrinks the bubble.
Fig. 1 — Pipeline Bubble Ratio: Naive vs. Micro-batched Schedules
Bubble overhead as a fraction of total computation time, for p=8 pipeline stages. More micro-batches reduce the bubble proportionally. 1F1B and AFAB have the same bubble size, but 1F1B stores fewer activations, enabling more micro-batches in practice.
Interleaved stages: Rather than assigning contiguous layers to each GPU (layers 1–4 on GPU 1), assign them in a round-robin pattern (layers 1, 5 on GPU 1; layers 2, 6 on GPU 2; etc.). Each micro-batch loops through the pipeline multiple times, but each pass through a GPU is shorter by a factor of v (the number of interleaved chunks per GPU). The bubble shrinks further:
The tradeoff: communication volume increases by v (more, smaller activation transfers) and scheduling becomes more complex. Llama 3.1 used a 1F1B schedule with interleaved stages and a tunable priority between "depth-first" (close loops fast) and "breadth-first" (fill the pipeline first).
Zero Bubble and DualPipe: The most recent frontier. The key observation is that the backward pass through a matrix multiplication involves two separable operations: backward for the inputs (B) and backward for the weights (W). While B must happen before earlier layers can continue their backward pass, W only needs to happen before the optimizer step. By scheduling W operations to fill the bubble, you can theoretically reduce it to zero. DeepSeek V3's DualPipe implementation runs two streams propagating from both ends of the pipeline, interleaving them to eliminate nearly all idle time. Their technical report notes they achieved "near-zero all-to-all communication overhead."
The playbook's benchmarks confirm PP's strength at inter-node scaling: going from 1 node (PP=8) to 2 nodes (PP=16) only drops performance by ~14%, compared to ~43% for TP at the same boundary. PP communicates less frequently and tolerates higher latency better.
Context Parallelism: Taming Long Sequences
Even with TP+SP sharding activations, the attention computation is quadratic in sequence length. At 128k tokens, even with full activation recomputation, the memory for attention at layer boundaries can exceed what a single node provides. Context Parallelism (CP) addresses this by splitting the input sequence across GPUs — not just outside the TP region (like SP does), but through the entire model, including attention.
Most operations — MLPs, LayerNorm — process tokens independently, so splitting the sequence across GPUs is trivial. The challenge is attention: each token needs access to key/value pairs from all positions in the sequence (or all prior positions, for causal attention). The solution is Ring Attention.
Each GPU starts with its local chunk of queries, keys, and values. It computes attention for the K/V pairs it already has. Simultaneously, it sends its K/V pairs to the next GPU in the ring and receives K/V pairs from the previous GPU. When the new K/V pairs arrive, it computes the next partial attention, accumulating results using online softmax (the same mathematical trick behind Flash Attention). After p rounds, every GPU has attended to the full sequence.
A naive ring allocation — tokens 1–4 on GPU 1, tokens 5–8 on GPU 2 — causes load imbalance with causal masks. GPU 1 computes much less work than GPU 4 because early tokens attend to fewer positions. Zig-Zag Ring Attention fixes this by interleaving the token assignment (tokens 1, 8, 9, 16 on GPU 1; tokens 2, 7, 10, 15 on GPU 2; etc.), balancing the causal attention workload evenly across all GPUs.
Expert Parallelism: Distributing Specialized Modules
Mixture-of-Experts (MoE) models like DeepSeek-V3, GPT-4, and Mixtral replace the single MLP in each Transformer layer with multiple "expert" MLPs. A learned router sends each token to one or a few experts. The result is a model with massive parameter counts (DeepSeek-V3 has 256 experts) but manageable compute per token — each token only activates a fraction of the total parameters.
Expert Parallelism (EP) distributes these experts across GPUs. Since experts are fully independent (no shared computation), each GPU simply hosts a different subset of experts. The communication primitive is all-to-all: tokens are routed to the GPU holding their assigned expert, processed, and routed back. This is lightweight compared to TP — no splitting of matrix multiplications, just routing hidden states to the right place.
EP is typically combined with DP, because EP only affects the MoE layers. Without DP, the non-expert layers (attention, LayerNorm) would be redundantly computed on every GPU. DeepSeek-V3 further optimizes by constraining the router to send each token to at most 4 nodes, keeping most token traffic intra-node to reduce cross-node communication.
The Five Dimensions, Composed
You now have the complete toolkit:
| Method | Shards what | Along which dimension | Primary bottleneck |
|---|---|---|---|
| Data Parallelism (DP) | Activations (different data per replica) | Batch | Limited by max batch size; communication at scale |
| Tensor Parallelism (TP/SP) | Parameters, activations | Hidden dimension / sequence | Requires high-bandwidth intra-node communication |
| Pipeline Parallelism (PP) | Model parameters (by layer) | Model depth | Idle bubble; complex schedules |
| Context Parallelism (CP) | Activations in attention | Sequence length | Communication in attention layers |
| Expert Parallelism (EP) | Expert parameters | Expert dimension | Requires MoE architecture; routing overhead |
And the three ZeRO stages, which layer on top of DP:
| ZeRO Stage | Shards | Communication cost vs. vanilla DP |
|---|---|---|
| ZeRO-1 | Optimizer states | Equivalent |
| ZeRO-2 | Optimizer states + gradients | Equivalent |
| ZeRO-3 (FSDP) | Optimizer states + gradients + params | ~1.5× (3Ψ vs 2Ψ) |
These dimensions are largely orthogonal and compose in predictable ways. TP occupies the fast intra-node NVLink. PP occupies the slower inter-node links for large models. DP spans whatever remains for batch-level scaling. CP adds sequence-dimension distribution when sequences are very long. EP distributes experts when you have MoE architectures.
The playbook's summary diagram captures this: for a single Transformer layer in an MoE variant, TP shards the hidden dimension within the TP group, SP/CP shard the sequence dimension, PP distributes layers across pipeline stages, EP distributes experts, and DP replicates across the remaining GPUs.
Finding the Right Configuration
Knowing the five dimensions is the easy part. Choosing the right combination for your specific model, cluster, and training target is the hard part. The playbook provides a systematic decision process, validated against thousands of benchmark configurations.
Step 1: Fit a training step in memory. Start by determining whether a single model instance fits on your available GPUs. For models under 10B, a single technique suffices — TP=8 or ZeRO-3 across 8 GPUs. For 10B–100B, combine TP=8 with PP (for depth sharding) or ZeRO-3 (for parameter sharding). For very long sequences, add CP. For MoE, add EP.
Step 2: Hit your target global batch size. Adjust DP degree and gradient accumulation steps to reach the desired batch size. Maximize DP over gradient accumulation (parallel is faster than sequential). For long sequences, CP can help increase the effective batch size.
Step 3: Optimize throughput. This is where empirical benchmarking matters. Scale TP within the node first (fast NVLink). Add DP with ZeRO-2 across nodes. When DP communication becomes a bottleneck (at ~512+ GPUs), transition to PP for the inter-node dimension. Experiment with micro-batch sizes — the optimal balance depends on your specific compute-to-communication ratio.
Fig. 2 — Optimal Parallelism Strategy by Model Size and Cluster Scale
Illustrative MFU (Model FLOPs Utilization) across model sizes and node counts. Larger models achieve higher utilization at scale. Small models on large clusters suffer from communication overhead dominating compute. At 1024+ GPUs, combining TP+PP+DP is typically required.
The playbook ran every possible distributed configuration for multiple model sizes across 1–64 nodes of 8×H100s and distilled the results into heatmaps of optimal configurations. Several patterns emerge:
First, efficiency drops as you add nodes, especially for smaller models. A 3B model on 64 nodes has far more communication overhead relative to its compute than a 70B model on the same cluster. You can partially compensate by increasing batch size, but you are ultimately constrained by your target global batch size.
Second, large models need enough GPUs to fit comfortably. An 80B model on 4 nodes barely fits and runs inefficiently near the GPU memory limit. Give it 8–16 nodes and efficiency jumps dramatically.
Third, implementation quality matters as much as strategy. The playbook team found that after optimizing their PP code, PP outperformed TP in scenarios where TP had previously won. After further optimizing TP's communication overlap, TP regained the lead. The theoretical ranking of strategies is constantly shifting based on how well each is implemented for your specific framework and hardware.
The Practical Reality
The playbook team ran over 4,100 distributed experiments (over 16,000 including test runs) on up to 512 GPUs. Even with the theoretical understanding and carefully planned configurations, the experience was humbling. PyTorch processes failed to clean up. Slurm forcefully terminated jobs, causing node failures. Simple benchmarks that should take minutes stretched into hours. Some jobs hung indefinitely.
This echoes what the SmolLM3 team experienced during their month-long training run (covered in Chapter 12): at scale, the challenge is not just choosing the right parallelism strategy. It is keeping the system alive while that strategy executes. Checkpoint frequently, checkpoint asynchronously, monitor everything, and expect failures.
The gap between theoretical peak performance and what you actually achieve in a training run — Meta achieved 38–41% MFU for Llama 3 405B — is the accumulated cost of all these realities: communication overhead, pipeline bubbles, load imbalances, kernel launch latency, hardware failures, and the imperfect overlap between computation and communication.
What This Chapter Teaches
We started this series (Chapters 12–14) at the hardware layer: what a GPU is made of, how memory hierarchies work, how GPUs talk to each other. Then we covered the first two parallelism dimensions: DP/ZeRO for batch scaling, TP/SP for intra-layer sharding. Now we have the complete picture: PP for depth, CP for sequences, EP for experts, and the framework for composing all five.
The Ultra-Scale Playbook's contribution is not just cataloging these techniques — the techniques are well-documented in their respective papers. The contribution is showing, through thousands of real experiments, how they interact in practice: where theoretical expectations hold, where they break down, and what the actual throughput and memory numbers look like on real hardware.
The conclusion from the playbook resonates: "Orchestrating large clusters of GPUs to train LLMs efficiently is no easy feat. It involves choosing the right parallelization strategy for a given model and cluster size, overlapping communication and computation where possible, and writing custom kernels that take into account the hardware layout."
None of these techniques is a silver bullet. Every one of them is a tradeoff between memory, computation, and communication. The art is in finding the right balance for your specific context — and having the infrastructure resilience to sustain it across days or weeks of continuous training. That balance is what separates a training run that finishes from one that never converges.