Splitting the Work: Data Parallelism, ZeRO & Tensor Parallelism

You have a model. You have a GPU. The model fits, barely. Training crawls. The obvious next step is to throw more GPUs at it. The non-obvious part is everything that happens after.

The previous chapter mapped the hardware layer — memory hierarchies, interconnects, the physics of data movement. This chapter is about the first two scaling strategies you reach for once a single GPU is no longer enough: Data Parallelism (with its memory-efficient extension, ZeRO) and Tensor Parallelism (with its complement, Sequence Parallelism). Together, they form the foundation of every large-scale training run happening today.

The Hugging Face team's Ultra-Scale Playbook ran over 4,000 experiments on up to 512 H100 GPUs, systematically measuring what works and what breaks. What follows draws heavily on those findings — but the goal here is not to reproduce their results. It is to build the mental model you need to reason about parallelism from first principles, so when you face your own cluster and your own model, you know where to start.

Data Parallelism: The First Dimension

The idea behind Data Parallelism (DP) is deceptively simple: replicate the entire model on every GPU, give each replica a different chunk of the batch, compute gradients independently, then average them. Each GPU sees different data but maintains an identical copy of the model. After the backward pass, an all-reduce operation synchronizes the gradients so every replica takes the same optimizer step.

This works because gradient computation is embarrassingly parallel across samples. If you have 8 GPUs, you process 8 micro-batches simultaneously and achieve roughly 8× throughput — assuming the gradient synchronization can be hidden behind the computation.

The first optimization is exactly that: overlapping the gradient all-reduce with the backward pass. Gradients for the last layer are ready before the backward pass reaches the first layer. By attaching all-reduce hooks to each parameter, you can start communicating gradients the moment they are computed, while the backward pass continues for earlier layers. PyTorch's DistributedDataParallel (DDP) does this automatically.

The second optimization is bucketing. GPU communication operations are more efficient on large tensors than on many small ones. Instead of launching an independent all-reduce for each parameter's gradient, you group gradients into buckets and synchronize each bucket in a single operation. Think of it as packing items into boxes before shipping — fewer, larger shipments beat many small ones.

The third optimization involves gradient accumulation. When combining gradient accumulation with DP, you should only synchronize gradients after the final accumulation step — not after every backward pass. PyTorch solves this with a model.no_sync() context manager that disables the all-reduce hooks during intermediate accumulation steps.

Global Batch Size = micro_batch_size × gradient_accumulation_steps × data_parallel_degree

Given a target global batch size, you trade gradient accumulation steps for data-parallel processes to speed up training. In practice, people maximize the DP degree first — it is inherently parallel — and only add gradient accumulation on top when they run out of GPUs before hitting the target batch size.

Fig. 1 — Data Parallelism: Throughput vs. Communication Overhead

Idealized throughput scales linearly with GPU count until communication overhead dominates. At 512+ GPUs, ring latency and bandwidth saturation erode the overlap between gradient sync and backward computation.

But DP has limits. As you add hundreds of GPUs, the overhead of the all-reduce grows — the network requirements become too large for the benefits. The playbook's benchmarks show throughput starting to drop quite significantly above a certain scale, while per-GPU memory stays constant. You cannot shard the model with vanilla DP; every replica holds the full model. If a single layer does not fit on one GPU, DP alone cannot help you.

ZeRO: Eliminating Redundancy

Here is the thing about vanilla DP that should bother you: every GPU stores an identical copy of the optimizer states, gradients, and parameters. For a model trained with mixed precision and the Adam optimizer, the memory breakdown per GPU looks like this:

Parameters (bf16): 2Ψ | Gradients (bf16): 2Ψ | Optimizer states + fp32 params: 12Ψ
Total per GPU: ~16Ψ bytes (where Ψ = parameter count)

Every single GPU holds all 16Ψ bytes. For 8 GPUs, that is 8× redundancy in optimizer states, 8× in gradients, 8× in parameters. ZeRO (Zero Redundancy Optimizer), developed by Microsoft DeepSpeed, eliminates this waste by partitioning these tensors across the DP ranks.

ZeRO-1 partitions the optimizer states. Each GPU only stores 1/N of the optimizer states (where N is the DP degree). After the backward pass, instead of an all-reduce on gradients, you do a reduce-scatter — each GPU gets only the gradient shard it needs for its portion of the optimizer. After the optimizer step, an all-gather reconstructs the updated parameters on every GPU. Memory for optimizer states drops from 12Ψ to 12Ψ/N. The communication cost is equivalent to vanilla DP.

ZeRO-2 adds gradient partitioning. Since each GPU only needs 1/N of the gradients (the shard corresponding to its optimizer state shard), we can drop the rest immediately after the reduce-scatter. Memory for gradients drops from 2Ψ to 2Ψ/N. Communication cost is still equivalent to vanilla DP. There is essentially no overhead compared to ZeRO-1 — ZeRO-2 is almost always the better choice.

ZeRO-3 (also called FSDP — Fully Sharded Data Parallelism) goes all the way: optimizer states, gradients, and parameters are all sharded. Each GPU stores only 1/N of everything. Before each layer's forward pass, an all-gather reconstructs the parameters on demand; after use, they are immediately discarded. The backward pass works the same way in reverse. Total memory drops to (2Ψ + 2Ψ + 12Ψ)/N — you can drive per-GPU memory arbitrarily low by increasing the DP degree.

Fig. 2 — Per-GPU Memory Usage Across ZeRO Stages (7B Model, Adam, bf16)

ZeRO-1 through ZeRO-3 progressively shard optimizer states, gradients, and parameters. At DP=64, ZeRO-3 reduces per-GPU model memory by ~64× compared to vanilla DP. Activation memory (not shown) is not affected by ZeRO.

The catch with ZeRO-3 is communication cost: 3Ψ total (an extra Ψ from the additional all-gathers for parameters) compared to 2Ψ for ZeRO-1/2. In practice, this is manageable with prefetching — gathering the next layer's parameters while computing on the current layer — as long as you do not scale DP past roughly 512 ranks.

Key insight: ZeRO does not help with activation memory. Activations differ across DP replicas (each sees different data), so they are inherently not redundant and cannot be sharded this way. To tame activations, you need Tensor Parallelism — or its complement, Sequence Parallelism.

Tensor Parallelism: Splitting Within a Layer

Data Parallelism and ZeRO replicate (or reconstruct) the full model on each GPU for every forward and backward pass. Tensor Parallelism (TP) takes a fundamentally different approach: it splits the matrix multiplications themselves across GPUs, so each device only ever computes a portion of each layer. This shards not just parameters, gradients, and optimizer states, but also activations — the one thing ZeRO cannot touch.

The math that makes this possible is the distributive property of matrix multiplication. Given Y = X × W, you can split the weight matrix W by columns and compute each sub-product independently:

Y = X · [W₁ | W₂] = [X·W₁ | X·W₂]

This is column-parallel: each GPU gets a column shard of W, receives the full input X (via broadcast), and produces a column shard of Y. An all-gather combines the partial outputs.

Alternatively, you can split W by rows and the input X correspondingly:

Y = [X₁ | X₂] · [W₁ / W₂] = X₁·W₁ + X₂·W₂

This is row-parallel: each GPU gets a row shard of W, a corresponding shard of X (via scatter), and produces a partial result. An all-reduce sums the partial results into the final output.

The key insight from the Megatron-LM paper is how to compose these within a Transformer block to minimize communication. For the MLP (feedforward) layers: use column-parallel for the first linear, then row-parallel for the second. This requires only one all-reduce per MLP block in the forward pass. For multi-head attention: split Q, K, V projections column-parallel (each GPU gets a subset of heads), and apply row-parallel to the output projection. Again, one all-reduce per attention block.

The result: 2 all-reduce operations per Transformer layer in the forward pass, and 2 in the backward pass. These sit directly in the computation's critical path — unlike DP's gradient sync, they cannot be fully overlapped with other work. This is the fundamental tradeoff of TP: you get activation sharding and memory savings, but you pay with exposed communication latency.

Fig. 3 — Tensor Parallelism: Throughput Drop at Inter-Node Boundary

Per-GPU throughput (TFLOPS) as TP degree increases. The steep drop from TP=8 to TP=16 corresponds to crossing the NVLink → EFA boundary. Within a node, TP communication rides on 900 GB/s NVLink. Across nodes, it drops to ~38 GB/s per GPU over EFA.

The playbook's benchmarks confirm what the theory predicts: the biggest performance cliff occurs when TP crosses the node boundary. Within a single 8-GPU node, NVLink provides ~900 GB/s bidirectional bandwidth — the all-reduces are fast. Going from TP=8 to TP=16 forces communication across the much slower inter-node network (~38 GB/s via EFA), and throughput drops by roughly 43%. This is why, in practice, TP is kept within a single node: TP ≤ 8 (or whatever your intra-node GPU count is).

There is also a model-architecture constraint. The TP degree should not exceed the number of key/value heads in the attention mechanism, because each GPU needs intact heads to compute attention independently. For Llama-3 8B with 8 KV heads, TP should stay at 8 or below. Exceeding this requires duplicating KV heads across GPUs, adding synchronization complexity.

Sequence Parallelism: The Missing Complement

Tensor Parallelism splits activations along the hidden dimension inside matrix multiplications. But operations like LayerNorm and Dropout need the full hidden dimension — they operate on complete activation vectors. In vanilla TP, this means gathering the full activations for these operations, which partially negates the memory savings.

Sequence Parallelism (SP) fixes this by splitting activations along the sequence dimension for the non-TP operations. The idea: in the TP region (MLP, attention), activations are sharded across the hidden dimension. In the SP region (LayerNorm, Dropout), activations are sharded across the sequence dimension. Transitions between these regions use all-gather (SP→TP) and reduce-scatter (TP→SP) operations.

The result is that the maximum activation size on any GPU drops from (batch × sequence × hidden) to (batch × sequence × hidden / TP), because at every point in the forward pass, activations are sharded along one dimension or the other. The communication volume is mathematically equivalent to vanilla TP — an all-reduce can be decomposed into an all-gather followed by a reduce-scatter — so SP adds no communication overhead.

Key insight: TP splits along the hidden dimension. SP splits along the sequence dimension. Together, they ensure activations are always distributed — never fully materialized on any single GPU. This is what makes TP+SP the standard for intra-node parallelism in modern training frameworks like Megatron-LM and Nanotron.

The playbook's benchmarks for a 70B model show dramatic memory savings: with TP/SP=16, you can fit sequence lengths of 16k tokens that would be impossible with TP alone. The throughput tradeoff is real — more parallelism means more communication — but within a single node, the NVLink bandwidth makes this practical.

How They Compose: The First Two Dimensions

DP and TP are orthogonal. They shard different things (data vs. model weights/activations), communicate different tensors (gradients vs. partial results), and have different scaling characteristics (DP scales wide across nodes, TP scales deep within a node). This means they compose naturally.

A typical starting configuration: TP=8 within each node (using NVLink), DP across nodes (using the cluster network). Each node holds one complete model replica; DP handles the gradient sync across replicas. If memory is tight, ZeRO-1 or ZeRO-2 can be layered on top of DP — they are complementary and add minimal overhead.

ZeRO-3, on the other hand, sits in a more nuanced relationship with TP. Both shard model parameters, but through different mechanisms: ZeRO-3 gathers parameters on demand before computation, while TP partitions the computation itself so full parameters are never needed. You can combine them, but the communication costs compound. The playbook's guidance: for most scenarios, choose one or the other as your primary parameter-sharding strategy, and complement it with DP (plus ZeRO-1/2) for batch-dimension scaling.

But what happens when the model is too large to fit even with TP=8? What about very long sequences where activation memory explodes? And how do you handle models with hundreds of specialized experts? These questions require the next set of tools: Pipeline Parallelism, Context Parallelism, and Expert Parallelism — the remaining three dimensions of the 5D parallelism framework. That is the subject of the next chapter.

What This Chapter Teaches

Data Parallelism is the simplest form of distributed training — replicate, split the data, average the gradients. ZeRO eliminates the memory redundancy that vanilla DP introduces, turning every GPU in a cluster into a fraction of a complete training state. Tensor Parallelism goes further by splitting the computation itself, sharding activations along with everything else, but at the cost of exposed communication in the critical path.

The reason understanding these techniques matters is not because you will implement them from scratch. PyTorch, DeepSpeed, Megatron-LM, and Nanotron handle the implementation details. The reason is that choosing between them — and choosing how to combine them — depends on understanding the tradeoffs: memory vs. communication, overlap potential vs. critical-path exposure, intra-node bandwidth vs. inter-node bandwidth. Every parallelism strategy is a bet on which resource is cheap and which is expensive in your specific setup.

Chapter 12 gave you the hardware map. This chapter gave you the first two movement strategies. The next chapter completes the picture with the remaining three dimensions and the art of composing all five into a coherent training configuration.