ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (1910.02054v3)

Published 4 Oct 2019 in cs.LG, cs.DC, and stat.ML

Abstract: Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world's largest LLM (Turing-NLG, 17B parameters) with record breaking accuracy.

PDF Abstract

Overview of ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

The paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" tackles the challenge of scaling deep learning models to trillions of parameters, a task that is currently constrained by hardware memory limitations. Existing parallelism techniques often fall short either due to inefficiencies in memory or communication overheads. The proposed solution, Zero Redundancy Optimizer (ZeRO), presents a novel approach to optimize memory utilization in distributed training, enabling significantly larger models to be trained efficiently by leveraging existing hardware.

Key Contributions

The paper introduces several core memory optimizations that are central to the ZeRO methodology:

Optimizer State Partitioning (P_os): Traditional data parallel training replicates the optimizer states (parameters, gradients, momentum states) across all devices, leading to significant memory redundancy. ZeRO partitions these optimizer states, distributing different portions to different devices. This leads to a substantial decrease in per-device memory usage, thereby allowing larger models to be fitted into the same hardware constraints.
Gradient Partitioning (P_g): By extending partitioning to gradients, memory footprint is further reduced while maintaining computation efficiency. This is achieved by applying a reduction operation only on the locally needed gradient partitions, thus optimizing communication volume to be on par with standard data parallel processes.
Parameter Partitioning (P_p): Further reducing memory usage, ZeRO efficiently partitions model parameters across devices. Parameters are broadcast only when necessary for computation, minimizing the communication overhead to a fraction of the total data volume, albeit at a slight increase in communication steps.

The implementation of these optimizations allows ZeRO to support models over 100 billion parameters with scalable throughput, as demonstrated on a cluster of 400 GPUs. The findings show ZeRO's capability to achieve an 8-fold increase in trainable model size and a 10x performance gain over existing frameworks.

Implications and Future Directions

ZeRO's approach underscores a pivotal shift in large-scale model training. By leveraging memory-efficient data parallelism, ZeRO allows training models that were previously inconceivable on current hardware. This holds significant implications for advancing domains that depend on large-scale models, particularly in natural language processing, where models with tens to hundreds of billions of parameters are becoming common.

Theoretically, ZeRO provides a framework that reduces reliance on model parallel strategies, which are often more complex to implement and scale. Practically, it democratizes large model training by making it accessible without the need for expensive, high-bandwidth interconnects like NVLink, which are currently prerequisites for efficient model parallelism.

Looking forward, ZeRO opens avenues for further exploration and optimization in distributed training frameworks. As computational resources expand, so too will the scale at which models can be trained, potentially reaching the sought-after trillion-parameter mark. With the projected advancements in GPU memory capacity and compute capabilities, ZeRO forms a foundational component that could integrate seamlessly with upcoming exa-scale computing systems.

Conclusion

ZeRO represents a significant step toward overcoming one of the key bottlenecks in deep learning research: the ability to train massive models efficiently. By eliminating redundancy and optimizing memory usage, ZeRO expands the boundary of what can be achieved with existing resources, paving the way for future innovations in AI model scaling. Researchers and practitioners should consider ZeRO as a viable option to explore the next generation of model architectures without being constrained by memory limitations.