Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning (2104.07857v1)

Published 16 Apr 2021 in cs.DC, cs.AI, cs.LG, and cs.PF

Abstract: In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs(40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.

ZeRO-Infinity: Advancing the Scalability of Deep Learning

The paper "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning" explores a significant advancement in the field of deep learning (DL) training systems. It introduces ZeRO-Infinity, a novel system designed to address the escalating demands of training large-scale DL models that are increasingly constrained by GPU memory limits. The contribution is a strategically engineered solution that leverages a heterogeneous memory setup, including GPU, CPU, and NVMe memory, to enable the training and fine-tuning of models with tens and even hundreds of trillions of parameters.

The authors argue that the growing size of DL models—prompted by improvements in model accuracy with increased parameters—remains bottlenecked by the limited memory scaling of GPUs. ZeRO-Infinity addresses this by exploiting the vast storage of CPU and NVMe memories in addition to GPU memory, thereby transcending the so-called "GPU memory wall." The paper states that ZeRO-Infinity successfully trains models with trillions of parameters across current GPU clusters while maintaining high efficiency and easing usability issues common in such scaling endeavors.

Key Contributions and Results

  1. Unprecedented Model Scale:
    • ZeRO-Infinity extends previous work in the ZeRO family of technologies, introducing the "infinity offload engine" that allows CPU and NVMe memory to be used efficiently to surpass GPU memory constraints.
    • The paper reports the capability of training models as large as 32 trillion parameters using 512 NVIDIA V100 GPUs, a feat 50 times greater than existing state-of-the-art methods anchored in 3D parallelism.
  2. Training Efficiency:
    • The discussed system achieves over 25 petaflops of throughput on the same hardware, reflecting substantial training efficiency. This efficiency stems from a novel data partitioning strategy, termed "bandwidth-centric partitioning," which leverages aggregate memory bandwidth across all devices.
  3. Ease of Use:
    • By obviating the need for complicating model code refactoring and multiple forms of parallelism, ZeRO-Infinity makes large-model training more accessible. It automates data movement and model initialization processes, allowing researchers to scale models without needing to address complex dependency chains manually.
  4. Broad Accessibility:
    • Another core implication of ZeRO-Infinity is its ability to democratize access to DL model fine-tuning. For instance, fine-tuning GPT-3 level models on a single NVIDIA DGX-2 node is made feasible using this technology.

Implications and Future Prospects

The implications of this research are substantial, potentially redefining the deadlines associated with training very large-scale models. By leveraging slower, cheaper memory storage successfully, ZeRO-Infinity prompts a shift in focus from merely increasing GPU memory to more holistic memory exploitation strategies that include CPU and NVMe.

With this capability, researchers can anticipate continued growth in model size, stretching expectations beyond the current 1000x escalation thresholds, pending advances in compute efficiency and device-to-device bandwidth. ZeRO-Infinity effectively prepares for future landscape shifts by decoupling performance potential from raw GPU memory capacity.

With the open-sourcing of ZeRO-Infinity through integration into DeepSpeed, broader adoption within the DL community could catalyze further innovation in how GPUs are utilized alongside CPUs and NVMe memory to handle increasingly complex workloads.

Conclusion

ZeRO-Infinity marks a pivotal development in advancing DL capabilities, both in terms of scale and practical application. By addressing the dual bottlenecks of memory and compute through novel aggregation and distribution strategies, the researchers deliver a tool that promises to make large-scale DL model training more feasible, efficient, and accessible than previously achievable. The implications of this work may further inform the design of future hardware systems optimized for DL tasks, furthering both the theoretical and practical avenues of artificial intelligence advancement.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Samyam Rajbhandari (21 papers)
  2. Olatunji Ruwase (20 papers)
  3. Jeff Rasley (10 papers)
  4. Shaden Smith (7 papers)
  5. Yuxiong He (59 papers)
Citations (318)
X Twitter Logo Streamline Icon: https://streamlinehq.com