DeepSpeed ZeRO Optimization
- DeepSpeed ZeRO Optimization is a memory- and communication-efficient system that partitions optimizer states, gradients, and parameters to enable scalable training of extremely large neural networks.
- It employs a three-stage partitioning process to significantly reduce per-device memory overhead and communication demands while maintaining high computational throughput.
- Integrated as a drop-in PyTorch optimizer, ZeRO facilitates training of models with hundreds of billions to trillions of parameters by eliminating redundant state replication.
DeepSpeed’s ZeRO Optimization (Zero Redundancy Optimizer) is a memory- and communication-efficient system for the distributed training of extremely large neural networks, developed to address the core bottlenecks of standard data and model parallelism. ZeRO enables multi-hundred-billion to trillion parameter model training by partitioning model states—parameters, gradients, and optimizer states—across devices, and orchestrating collective operations to minimize redundancy and maximize hardware utilization. Its techniques have become fundamental to large-scale language modeling and deep learning infrastructure, spawning a suite of derivative optimizations and alternate frameworks. DeepSpeed integrates ZeRO as a drop-in PyTorch optimizer wrapper, providing seamless support for state-of-the-art distributed training regimes and scaling transformer networks far beyond what replication-based methods can achieve (Rajbhandari et al., 2019).
1. Technical Foundations and Motivation
In mixed-precision Adam training, the vast majority of GPU memory is not consumed by raw parameters but by replicated parameter and optimizer state (momentum and variance) copies on each device. For example, a 1.5B-parameter GPT-2 model in fp16, with fp32 optimizer state, fits only ∼3 GB fp16 weights, but data parallel (DP) replication quickly exceeds available memory per GPU due to full replication of params, grads, and optimizer states. Model parallelism (MP) alleviates memory pressure by slicing tensors, but incurs extensive fine-grained inter-GPU communication and limited computational efficiency. ZeRO addresses this directly by eliminating redundant state through three stages of sharded storage, retaining the coarse-grained, efficient computation and low latency communication characteristics of DP while achieving MP’s memory economy (Rajbhandari et al., 2019).
2. The ZeRO Three-Stage Memory Partitioning
ZeRO decomposes training state redundancy through three progressive stages, each targeting a distinct component:
Stage 1: Optimizer State Partitioning (Pₒₛ)
- Each of Ψ parameters has: 1 fp16 weight (2 bytes), 1 fp16 gradient (2 bytes), 1 fp32 weight, 1 fp32 momentum, 1 fp32 variance ( bytes/param).
- Standard DP stores all of these per GPU: bytes.
- ZeRO-1 partitions optimizer state across GPUs, shrinking its footprint per device:
as ; a 4× reduction vs. baseline.
Stage 2: Gradient Partitioning (P₍g₎)
- Gradients are partitioned such that each replica holds only its slice post-backward; achieved by switching all-reduce to reduce-scatter.
- Memory per device:
approaching an 8× saving.
Stage 3: Parameter Partitioning (P₍p₎)
- Parameter shards are broadcast only for the current layer as needed and immediately discarded after use.
- Each device now holds only bytes (for fp16 params and grads) and for sharded fp32 state:
yielding a linear -fold reduction.
| Stage | Memory per device | Communication per step | Asymptotic memory savings |
|---|---|---|---|
| ZeRO-1 | All-reduce on gradients | DP | |
| ZeRO-2 | Reduce-scatter s.t. gradients | DP | |
| ZeRO-3 | Reduce-scatter, all-gather | DP |
ZeRO maintains “coarse-grained” collective calls, avoiding high-frequency, fine-grained communication characteristic of naïve MP, and thus preserves high compute throughput and low latency (Rajbhandari et al., 2019).
3. Communication Efficiency and Scaling
For fully sharded ZeRO-3, each training step per GPU incurs one reduce-scatter for gradients and two broadcasts/all-gathers for parameters (during forward and backward). While this increases per-step communication from bytes (baseline DP) to bytes (50% overhead), the elimination of per-replica state and strict collective minimization supports extreme scaling. Key empirical findings include:
- ZeRO-100B achieves smooth scaling from 1.5B to 170B parameters on 400 GPUs, with aggregate throughput of 15 PFLOPS (30–38 TFLOPS/GPU)—10× the previous SOTA.
- Superlinear scaling is observed (e.g., 60B model: doubling GPUs more than doubles throughput) due to both diminishing per-GPU memory and higher per-GPU batch size.
- ZeRO-3 supports models up to 13B parameters on 128 GPUs using pure DP, with >40 TFLOPS/GPU performance; naïve DP fails >1.5B parameters (Rajbhandari et al., 2019).
- Horizon for trillion-parameter scaling: with , per-GPU memory for a -parameter model is GB, which is feasible on contemporary hardware configurations (Rajbhandari et al., 2019).
4. Advanced Derivatives and Extensions
ZeRO’s core methodology has catalyzed numerous enhancements and specializations, including:
- ZeRO-Infinity: Incorporates a three-tier (GPU/CPU/NVMe) memory hierarchy for exabyte-scale model training. Key contributions are bandwidth-centric partitioning, tiered asynchronous prefetch (NVMe→CPU→GPU), and overlapping state movement with computation. This enables fine-tuning trillion-parameter models on a single DGX-2 node, and training up to 32 trillion parameter models across 512 GPUs, sustaining >25 PFLOPS and showing superlinear scaling (Rajbhandari et al., 2021).
- ZeRO++: Reduces total cross-node communication volume by via three composable techniques: block-wise INT8 quantized all-gather of weights, intra-node parameter data remapping to save inter-node bandwidth, and two-hop block-quantized INT4 gradient averaging using all-to-all collectives. This doubles throughput on low-bandwidth clusters at 384-GPU scale with minimal loss in accuracy (Wang et al., 2023).
- AsyncHZP (Hierarchical ZeRO Parallelism): Decouples sharding group sizes for parameters, gradients, and optimizer states for further reduction of bandwidth and latency bottlenecks. Employs multi-stream asynchronous scheduling of all-gather/reduce-scatter collectives in dedicated background threads to optimize overlap and achieve up to 25% higher MFU on large-cluster runs; supports seamless pipelining with other forms of parallelism (Bai et al., 23 Oct 2025).
5. DeepSpeed Integration and Usability
ZeRO is implemented in DeepSpeed as a drop-in optimization layer requiring no changes to model architecture or code. The optimizer is accessible via the deepspeed.zero_optim PyTorch API, supporting all stages, with ZeRO stages 1+2 enabled by default and stage 3 for extreme regimes. DeepSpeed handles mixed-precision training, activation checkpointing, CPU/NVMe offload, integration with external model-parallel frameworks (e.g., Megatron-LM), and on-the-fly memory defragmentation (Rajbhandari et al., 2019, Rajbhandari et al., 2021). Configuration is controlled by straightforward JSON keys, and memory management primitives (e.g., circular buffer pools, event synchronization) mitigate NUMA and fragmentation issues (Bai et al., 23 Oct 2025).
6. Theoretical and Practical Analysis
ZeRO decomposes per-device memory as , a critical property for scaling. Bandwidth and latency costs are transparent, dictated by the group sizes in sharded collectives; overlap and fusion strategies (as in ZeRO++, AsyncHZP, DeepCompile) further compress communication costs. Weak scaling is often superlinear due to bandwidth amortization and increased aggregate I/O. In heterogeneous memory settings (ZeRO-Infinity), shard movement pipeline and device-parallelized fetch achieve near-ideal overlap, minimizing wall-clock time (Rajbhandari et al., 2021).
7. Specialized Use Cases and Privacy Enhancements
ZeRO forms a foundation for privacy-preserving distributed learning at scale. For instance, in DP-ZeRO, per-sample gradient clipping and noise are inserted transparently into the ZeRO state shard pipeline, maintaining the same asymptotic communication and memory costs while enforcing -differential privacy at state-of-the-art model scales (e.g., 100B-parameter GPT models on 256 nodes, memory per GPU 12–40 GB, <5% throughput penalty) (Bu et al., 2023). DP-ZeRO demonstrates that ZeRO-3 can be compatibly composed with differential privacy without undermining scaling properties.
References
- "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (Rajbhandari et al., 2019)
- "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning" (Rajbhandari et al., 2021)
- "ZeRO++: Extremely Efficient Collective Communication for Giant Model Training" (Wang et al., 2023)
- "Zero redundancy distributed learning with differential privacy" (Bu et al., 2023)
- "DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training" (Tanaka et al., 14 Apr 2025)
- "AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training" (Bai et al., 23 Oct 2025)