DeepSpeed ZeRO Stage 3 Overview
- DeepSpeed ZeRO Stage 3 is a fully sharded data-parallel strategy that partitions parameters, gradients, and optimizer states to reduce per-GPU memory requirements.
- It enables training of large-scale models with hundreds of billions of parameters by efficiently managing collective communications like all-gather and reduce-scatter.
- ZeRO-3 integrates with high-performance clusters and supports dynamic extensions such as proactive prefetching to overcome static implementation challenges.
DeepSpeed ZeRO Stage 3 (often abbreviated as ZeRO-3) is a fully sharded data-parallel optimization strategy designed to enable the efficient training of extremely large-scale deep learning models—those with hundreds of billions of parameters—by optimally partitioning all model states across multiple GPUs. As of 2026, ZeRO-3 underpins many state-of-the-art LLM pretraining and fine-tuning workflows and has become foundational for high-throughput, memory-efficient distributed deep learning.
1. Model State Partitioning and Collective Communication
ZeRO-3 operates by fully sharding all components of the model state—parameters, gradients, and optimizer states—across GPUs. In contrast to classical data parallelism, which maintains redundant full copies of the model state on each device, ZeRO-3 divides each of these states into shards, with each device holding only its local shard:
where is parameter memory, is gradient memory, is optimizer state memory, and is memory for activations and temporaries (Rajbhandari et al., 2019, Benington et al., 2023, Tanaka et al., 14 Apr 2025).
Forward and Backward Passes:
- Before computing layer in the forward or backward pass, each GPU issues an all-gather to reconstruct the full parameter tensor from its local shard. This buffer is released immediately after the layer's computation is complete.
- After the backward pass, gradients are reduce-scattered so that each GPU receives the shard of the gradient corresponding to its parameter shard. The local optimizer update is then performed using these shards alone.
Communication Costs:
Per iteration, the total inter-device communication volume is
0
(additional communication for reduce-scatter of gradients is of comparable size). For large 1, the reduction in memory per GPU can approach the theoretical 2 scaling, while the communication cost increases by approximately 3 compared to classic data-parallelism due to the need for frequent all-gather operations (Rajbhandari et al., 2019, Benington et al., 2023).
2. Memory and Throughput Scaling Properties
By fully partitioning model state, ZeRO-3 unlocks memory savings that scale inversely with the number of devices:
4
with 5 denoting the optimizer state multiplier (typically 12 for Adam in mixed-precision), 6 the number of parameters, 7 the DP degree, and 8 non-model memory (activations, workspace). For 9B, 0, and 1, the per-device model-state memory is only 3.3 GB (vs. 218 GB for unsharded) (Rajbhandari et al., 2019, Smith et al., 2022).
This scheme enables training models with 2B parameters on practical clusters, as demonstrated in models such as MT-NLG 530B using ZeRO-3 with tensor and pipeline parallelism (Smith et al., 2022). With this strategy, large-scale transformer models that would otherwise require thousands of gigabytes per device can be trained with modest hardware by combining parameter, gradient, and optimizer sharding.
The saving in memory directly translates to the ability to train with larger batch sizes per GPU, which leads to higher throughput and better scaling efficiency as the number of devices increases. For example, in empirical studies, ZeRO-3 enabled per-GPU batch sizes of 16 or greater for very large models (versus 8 or 12 with ZeRO-1/2) and super-linear device utilization when scaling up the cluster size (Rajbhandari et al., 2019, Benington et al., 2023).
3. Integration and Algorithmic Workflow in High-Performance Clusters
In multi-parallel training schemes (3D parallelism: data, tensor, pipeline), ZeRO-3 operates in the data-parallel dimension, interleaving all-gather and reduce-scatter collectives with tensor and pipeline communications:
- Parameter All-Gather: Before a layer's forward computation, a distributed all-gather reconstructs the full weight matrix from its shards.
- Gradient Reduce-Scatter: After each backward pass, gradients are reduce-scattered so that each device stores only its corresponding shard.
- Optimizer Step: Local to each device, using only the optimizer states, parameters, and gradients for its assigned shard (Smith et al., 2022).
ZeRO-3 is instantiated in common frameworks (e.g., DeepSpeed) via configuration, allowing users to enable full state partitioning and configure parameters such as bucket sizes (for efficient collective operations), memory offload options, and communication overlap mechanisms. Checkpointing and restoration of model state require collecting distributed shards from all devices (Rajbhandari et al., 2019).
4. Limitations in Static ZeRO-3 Implementations
Despite its strengths, static ZeRO-3/FSDP implementations encounter suboptimal performance due to:
- Non-adaptive Prefetching: Prefetch buffer sizes for all-gathers are preset and fail to dynamically respond to activation memory pressure. Over-aggressive prefetching may result in out-of-memory errors, while conservative settings miss optimal compute-communication overlap (Tanaka et al., 14 Apr 2025).
- Lack of Memory Usage Anticipation: The impact of future all-gather buffers on overall GPU memory cannot be predicted. This inhibits efficient coordination with unsharding strategies.
- Uncoordinated Unsharding: Static ZeRO-3 allows some parameters to remain unsharded to save on communication, but without runtime memory tracking, the benefits cannot be tuned or combined effectively with prefetching.
These brittlenesses motivate compiler-driven and memory-aware modifications that outperform the static dependency-graph strategies of baseline ZeRO-3 systems (Tanaka et al., 14 Apr 2025).
5. Compiler-Driven and Algorithmic Extensions
5.1 DeepCompile: Profiling-Guided Optimizations
DeepCompile introduces a compilation pipeline atop PyTorch IR, implementing ZeRO-3 and three optimizations:
- Proactive Prefetching: Dynamically schedules all-gathers as early as possible to maximize communication-computation overlap, subject to profiling-based memory threshold constraints. Multiple all-gathers may be fused when bandwidth utilization is improved.
- Selective Unsharding: Allocates remaining memory headroom after prefetch scheduling to keep selected high-communication parameters permanently unsharded, eliminating associated all-gather calls in the backward pass and reducing communication.
- Adaptive Offloading: Slices optimizer states into small fragments and only offloads the minimal necessary subset to host memory when needed, scheduling async offloads/reloads to overlap with compute and memory headroom monitoring.
These passes are coordinated via iterative profiling and computation-graph rewrites, so the optimizer can exploit fine-grained dynamics in memory usage and operator timing that static ZeRO-3 cannot capture (Tanaka et al., 14 Apr 2025).
5.2 Quantitative Performance
In the context of large-scale models:
- Llama 3 70B: DeepCompile achieves up to 3 throughput gains over ZeRO-3 with gradient accumulation.
- Mixtral 8×7B MoE: Up to 4 improvements.
- Memory Utilization: DeepCompile raises GPU utilization dynamically (e.g., up to 65 GB used on 80 GB H100s), compared to static ZeRO-3/FSDP baselines (∼40 GB).
- Extreme Memory Pressure: With adaptive offloading and 16 GPUs (Llama 3 70B), DeepCompile yields up to 5 throughput increase versus DeepSpeed CPU-offload baselines (Tanaka et al., 14 Apr 2025).
Correctness: Across tasks (e.g., AG News), DeepCompile's advanced ZeRO-3 matches baseline ZeRO-3 within nondeterminism bounds in final model accuracy.
6. Communication Bottlenecks and Hybrid Approaches
On commodity clusters, the major barrier to ZeRO-3 scalability is the volume of inter-node all-gather communication, exacerbated by limited network bandwidth. Recent advances such as FCDP (Fully Cached Data Parallel) propose reinterpreting host (CPU) memory as a caching layer rather than an overflow storage:
- Host Caching: Cache parameters after forward all-gather in pinned CPU buffers and reuse them for the backward pass via fast intra-node all-gather, cutting inter-node all-gather volume by 50% (from 6 to 7 per iteration).
- Selective Communication for PEFT: In parameter-efficient fine-tuning, communicate only trainable parameter subsets, reducing inter-node traffic by 99%+.
- Implementation Impact: On clusters with 100 Gbps InfiniBand and 512 GB CPUs per node, FCDP matched ZeRO-3 batch size but achieved 8 higher throughput for GPT-10B–30B. In LoRA-tuned settings (9), FCDP achieved up to 0 higher throughput than ZeRO-3 due to PEFT-aware selective communication (Park et al., 6 Feb 2026).
A plausible implication is that, for bandwidth-limited clusters, further hybridization of ZeRO-3 with dynamic host caching and trainable-parameter sparsity is required for optimal scale-out.
7. Best Practices, Trade-offs, and Applicability
ZeRO-3 is optimal when:
- Model size or desired batch size exceeds per-GPU memory even after optimizer and gradient partitioning (i.e., for 1B).
- High-bandwidth interconnects (InfiniBand or NVLink) are available to mitigate increased all-gather and reduce-scatter collectives.
- The user can tolerate increased complexity in checkpointing and state management due to sharded model states (Rajbhandari et al., 2019, Benington et al., 2023).
Less optimal scenarios:
- For small models (22B) or small node counts, ZeRO-3’s overhead may exceed benefits—ZeRO-2 may suffice.
- On commodity networks, FCDP or DeepCompile-like dynamic memory management may be essential to avoid severe communication bottlenecks (Park et al., 6 Feb 2026).
- For compute-bound workloads or when activations dominate memory, memory savings from ZeRO-3 are less critical.
Configuration guidance:
Key parameters for high-performance ZeRO-3 include bucket sizing (for efficient collectives), communication overlap enablement, offload strategies (CPU/NVMe), and careful batch size tuning to maximize per-GPU utilization (Rajbhandari et al., 2019).
Checkpointing and Fault Tolerance:
Model sharding requires all ranks’ shards for checkpoint save/restore; frameworks offer built-in mechanisms, but distribution across multiple devices increases operational complexity.
References
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (Rajbhandari et al., 2019)
- Scaling Studies for Efficient Parameter Search and Parallelism for LLM Pre-training (Benington et al., 2023)
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B (Smith et al., 2022)
- DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training (Tanaka et al., 14 Apr 2025)
- FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training (Park et al., 6 Feb 2026)