Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero Redundancy Optimizer (ZeRO)

Updated 7 February 2026
  • ZeRO is a distributed training framework that reduces redundant memory usage by partitioning key model states across devices.
  • It operates in progressive stages—ZeRO-1, ZeRO-2, and ZeRO-3—to incrementally shard optimizer states, gradients, and parameters.
  • Innovative communication strategies in ZeRO enable near-linear scaling, facilitating efficient training of models with hundreds of billions of parameters.

The Zero Redundancy Optimizer (ZeRO) is a large-scale memory and communication optimization framework for distributed training of deep neural networks, designed to overcome the scalability barriers found in conventional data-parallel and model-parallel paradigms. ZeRO systematically partitions the three main categories of model states—parameters, gradients, and optimizer states—across devices, eliminating redundant memory consumption typical in legacy schemes, while retaining high arithmetic intensity and favorable communication patterns. The ZeRO methodology has proven instrumental for enabling efficient training of models from hundreds of billions to trillions of parameters on modern hardware infrastructures (Rajbhandari et al., 2019, Chen et al., 2023).

1. ZeRO Methodology: Core Principles and Partitioning Stages

ZeRO addresses redundant memory replication by decomposing data parallel (DP) memory into components and partitioning them in successive optimization stages. Let PP denote the number of model parameters, DD the number of data-parallel ranks, AA the bytes per parameter, GG bytes per gradient, SS bytes per optimizer state.

ZeRO-1 (Optimizer State Partitioning):

  • Only optimizer states (e.g. Adam momentum, variance) are sharded across DD ranks; parameters and gradients are fully replicated.
  • Memory per rank: AP+GP+(SP)/DA P + G P + (S P)/D
  • Communication involves AllReduce for gradient synchronization and AllGather for broadcasting updated parameters.

ZeRO-2 (Gradient Partitioning):

  • In addition to ZeRO-1, gradients are also partitioned across ranks. Each device retains $1/D$ of the gradient tensor.
  • Memory per rank: AP+(GP)/D+(SP)/DA P + (G P)/D + (S P)/D
  • Gradient synchronization leverages ReduceScatter; updated parameters are re-gathered with AllGather.

ZeRO-3 (Parameter Partitioning):

  • All three states (parameters, gradients, optimizer states) are partitioned, so each device holds only $1/D$ of each.
  • Memory per rank: (AP+GP+SP)/D(A P + G P + S P)/D
  • During forward, AllGather collects each layer’s parameters; backward involves ReduceScatter; optimizer updates are strictly local.

Communication Volume per Iteration

Stage Communication Formula
DP AllReduce $2 (G P)$
ZeRO-1 RS+AG (G+1)P(G+1) P
ZeRO-2 RS+AG (G+1)P(G+1) P
ZeRO-3 RS+AG (×2) (G+2)P(G+2) P

The partitioning enables memory footprints to scale inversely with DD, permitting the training of extremely large models with fixed hardware resources (Rajbhandari et al., 2019, Bai et al., 23 Oct 2025).

2. Architectural Bottlenecks and Communication Complexity

ZeRO’s all-to-all sharding improves memory scalability but introduces unique communication bottlenecks, particularly for ZeRO-3:

  • Collective operations: Each micro-batch, per-layer execution requires AllGather (parameters) and ReduceScatter (gradients), both spanning the full DP group.
  • Network sensitivity: As cluster size grows, inter-node bandwidth (e.g., InfiniBand) lags behind intra-node links (e.g., NVLink), amplifying network bottlenecks.
  • Micro-batch scaling: Increasing cluster size for fixed global batch reduces per-device micro-batch size, reducing compute/comm ratio and Model FLOPs Utilization (MFU).
  • As demonstrated, effective bandwidth of collectives can degrade significantly at scale, and MFU notably drops with higher rank counts (e.g., 63%→36%63\% \to 36\% MFU for LLaMA-7B, 8 to 1024 GPUs under ZeRO-1) (Chen et al., 2023).

These properties motivate further optimization—particularly communication/compute overlap, communication volume reduction, and flexible sharding strategies (Chen et al., 2023, Wang et al., 2023, Bai et al., 23 Oct 2025).

3. Enhancements: Flexible Sharding and Communication-Reducing Extensions

Recent research pursues flexibility and communication efficiency beyond canonical ZeRO-3, including:

  • Flexible Per-State Sharding: AMSP introduces parameterizable sharding for parameters (sps_p), gradients (sgs_g), and optimizer states (soss_{os}), each selecting among Full-Replica, Partial-Shard, or Full-Shard modes. These modes can be group-specific along hierarchical device meshes (e.g., intra-node vs. inter-node) (Chen et al., 2023).
  • Block-Quantized Collectives (ZeRO++): Communication collectives (e.g., AllGather on weights) are conducted on low-precision, block-quantized (INT8/INT4) tensors, trading negligible accuracy loss for 2×2\times–4×4\times lower bandwidth use (Wang et al., 2023).
  • Hierarchical Weight Partitioning: Local (intra-node) weight replication allows certain collectives to avoid cross-node communication by exploiting hardware topology.
  • All-to-All Quantized Gradient Averaging: Two-stage (intra-node then inter-node) all-to-all for quantized gradients further reduces the cross-node communication footprint, with full-precision summation preserving convergence (Wang et al., 2023).

A generalized communication cost model for AMSP is: Tcomm=Tp+Tg+Tos0+Tos1T_{comm} = T_p + T_g + T_{os}^0 + T_{os}^1 with each component defined in terms of collective type, message size, and communicator configuration.

These mechanisms are selected and coordinated via heuristics or integer programming solvers subject to hardware and model memory/comm constraints (Chen et al., 2023).

4. Asynchronous and Hierarchical Scheduling

To address the cost of blocking collectives and pipeline stalls:

  • AsyncHZP and AMSP’s Executor orchestrate asynchronous, multi-stream execution schedules to overlap communication and computation. Key primitives (AllGather, ReduceScatter) are triggered in background streams, maximizing overlap across layer boundaries and reducing idle hardware cycles (Bai et al., 23 Oct 2025, Chen et al., 2023).
  • Hierarchical Sharding: AsyncHZP decouples sharding group sizes for different model states, allowing optimizer states, gradients, and parameters to use custom replica groups (Z1,Z2,Z3)(Z_1, Z_2, Z_3). This design tailors collective communication to hardware topology (e.g., sharding optimizer states across the entire cluster while keeping weights replicated within nodes for fast NVLink all-gathers) (Bai et al., 23 Oct 2025).
  • The cumulative memory and communication cost for AsyncHZP is

Mhzp=12N/Z1+4N/Z2+2N/Z3M_{hzp} = 12N/Z_1 + 4N/Z_2 + 2N/Z_3

Chzp=2NZ3−1Z3+4NZ2−1Z2C_{hzp} = 2N \frac{Z_3-1}{Z_3} + 4N \frac{Z_2-1}{Z_2}

  • Overlap strategies yield near-100% computation/communication concurrency with negligible memory fragmentation, as evidenced by contiguous cyclic buffer pools and two-stream asynchronous scheduling (Bai et al., 23 Oct 2025).

5. Empirical Performance and Scalability

Key empirical results reported in original and follow-up ZeRO works include:

  • Hardware Scaling: ZeRO-100B demonstrates super-linear speedups up to 400 GPUs, training up to 100B-parameter models with >10×>10\times throughput gain compared to Megatron-LM model parallelism (Rajbhandari et al., 2019).
  • Large Model Training: ZeRO allows, for example, training up to 128B-parameter models on 64 GPUs (Stage 3), and up to 2T parameters on 1024 GPUs, under reasonable memory and bandwidth constraints (Rajbhandari et al., 2019).
  • MFU and Throughput: AMSP achieves 52%52\% MFU on LLaMA-13B with 1024 GPUs versus 4%4\% for ZeRO-3 and 33%33\% for MiCS. Throughput is improved by up to 12.7×12.7\times over ZeRO++ on the largest models (Chen et al., 2023).
  • Bandwidth-Limited Scale: ZeRO++ delivers a 4×4\times reduction in cross-node communication volume (from $3M$ to $0.75M$, MM=model size), translating into up to 2.16×2.16\times real-world throughput improvement on 384 GPUs with 100 Gbps networking (Wang et al., 2023).
  • Cluster Linearity: AsyncHZP attains near-linear scaling to 1,000 device clusters, with $10$–20%20\% throughput advantage compared to tensor and pipeline parallel baselines (Bai et al., 23 Oct 2025).

6. Practical Considerations and System Integration

Integration of ZeRO and its extensions involves several system design considerations:

  • DeepSpeed: ZeRO is natively implemented in DeepSpeed, with support for configuration-based selection of ZeRO stage, communication kernel overlap, and micro-batching strategies (Rajbhandari et al., 2019).
  • Communication/Memory Trade-offs: Users can fine-tune sharding group sizes, collective kernel precision (INT4/INT8 vs FP16), and memory-vs-bandwidth preferences to suit hardware topology and model scale. For example, ZeRO++’s hierarchical partitioning trades additional intra-node parameter storage for cross-node communication elimination (Wang et al., 2023).
  • Batch Sizing: Performance is sensitive to per-device batch size; small micro-batches (from large DP) worsen comm/compute ratio. Gradient accumulation and activation recomputation (ZeRO-R) help mitigate OOM at cost of additional compute and communication (Rajbhandari et al., 2019, Chen et al., 2023).
  • Overlap Strategies: MAXMFU and high throughput are dependent on pipelining communication using asynchronous CUDA streams, prefetching parameters, and overlapping ReduceScatter operations with computation (Bai et al., 23 Oct 2025, Chen et al., 2023).

7. Extensions, Limitations, and Future Directions

  • Communication Boundaries: ZeRO’s primary scaling bottleneck at very large device counts is dominated by collective operation costs, with diminishing returns expected as networks approach their bisection bandwidth limit. Advanced algorithms (block quantization, partial sharding, hierarchical collectives) remain active areas of research (Wang et al., 2023, Chen et al., 2023).
  • Mixed Precision and Quantization: The migration to INT4/INT8 collectives for gradients and weights respectively offers significant bandwidth saving with marginal (sub-0.1%0.1\%) regression on task accuracy (Wang et al., 2023).
  • Adaptive Sharding: General algorithms for device-, layer-, and state-specific sharding constitute the next frontier. Solvers that adapt communication patterns to topological profiling and workload distribution maximize both hardware utilization and model capacity at scale (Chen et al., 2023, Bai et al., 23 Oct 2025).
  • API Stabilization: The trend in ZeRO system research is toward API simplification—offering data-parallel-like interfaces while hiding hierarchical, asynchronous implementation details (Bai et al., 23 Oct 2025).

ZeRO and its derivatives have fundamentally shifted the practical limits of large model training, enabling cluster-scale, memory-optimal, and communication-efficient regimes that underpin current and future advances in LLM development and deployment (Rajbhandari et al., 2019, Chen et al., 2023, Wang et al., 2023, Bai et al., 23 Oct 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero Redundancy Optimizer (ZeRO).