Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training (2306.10209v1)

Published 16 Jun 2023 in cs.DC, cs.AI, cs.LG, and cs.PF
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Abstract: Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of LLMs on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

The paper "ZeRO++: Extremely Efficient Collective Communication for Giant Model Training" presents enhancements to the ZeRO optimizer aimed at improving the training efficiency of LLMs on GPU clusters. The innovations introduced are critical given the increased communication bottlenecks encountered when scaling model training across diverse and large-scale distributed systems.

Core Contributions

The authors introduce ZeRO++, a set of communication volume reduction techniques designed to optimize ZeRO’s performance, particularly in resource-constrained environments. The three main strategies are:

  1. Quantized Weight Communication for ZeRO (qwZ): By quantizing model weights to INT8 during the forward pass all-gather operation, communication volume is halved. Utilizing block-based quantization ensures minimal precision loss, making it feasible to maintain training accuracy.
  2. Hierarchical Partitioning for ZeRO (hpZ): This involves a secondary partitioning of FP16 weights within compute nodes to eliminate cross-node communication during backward all-gather. It results in significant communication efficiency by utilizing high-bandwidth intra-node links.
  3. Quantized Gradient Communication for ZeRO (qgZ): A novel all-to-all quantized gradient averaging paradigm replaces the traditional reduce-scatter collective. The quantized data (INT4) is communicated and then expanded back to full precision before reduction, reducing the inter-node communication volume considerably without degrading precision.

Performance and Implications

ZeRO++ achieves a 4x reduction in communication volume compared to the baseline ZeRO, which translates into up to 2.16x efficiency improvement at a 384 GPU scale. This optimization is crucial for maintaining high throughput and performance, especially in low-bandwidth settings typical of many cloud environments.

These improvements extend ZeRO’s applicability, potentially democratizing access to efficiently train massive models by lowering the hardware bandwidth requirements. This accessibility is particularly beneficial for organizations with limited computational infrastructure.

Future Directions

The techniques introduced in ZeRO++ could lay the groundwork for further innovations in distributed training strategies. Future research could explore finer granularity in quantization, adaptive communication strategies based on real-time bandwidth availability, and integration with other optimizations like gradient sparsification. Further, as hardware configurations continue to evolve, adaptation of these techniques could help maximize the utilization of emerging interconnect technologies.

Conclusion

ZeRO++ represents a significant advancement in optimizing communication for large-scale model training. By making distributed training more bandwidth efficient, it addresses critical scalability challenges. This makes large model training more accessible, aligning well with the broader goals of scaling AI solutions effectively and efficiently.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems 30 (2017).
  2. Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43.
  3. Datasheet for the Pile. arXiv:2201.07311 [cs.CL]
  4. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. (2022).
  5. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749–1783.
  6. Large scale distributed deep networks. Advances in neural information processing systems 25 (2012).
  7. Tim Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561 (2015).
  8. 8-bit Optimizers via Block-wise Quantization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=shpkpVXzo3h
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. dgx1 2017. NVIDIA DGX-1. https://www.nvidia.com/en-us/data-center/dgx-1/.
  11. dgx2 2018. NVIDIA DGX-2. https://www.nvidia.com/en-us/data-center/dgx-2/.
  12. Communication Quantization for Data-Parallel Training of Deep Neural Networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments (Salt Lake City, Utah) (MLHPC ’16). IEEE Press, 1–8.
  13. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).
  14. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
  15. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. ArXiv abs/1811.06965 (2018).
  16. Infiniband Sharp white paper 2021. NVIDIA InfiniBand Adaptive Routing Technology. https://nvdam.widen.net/s/whmszwfrbt/infiniband-white-paper-adaptive-routing-1846350.
  17. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).
  18. 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed. CoRR abs/2104.06069 (2021). arXiv:2104.06069 https://arxiv.org/abs/2104.06069
  19. Microsoft. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/.
  20. PipeDream: Generalized Pipeline Parallelism for DNN Training. In ACM Symposium on Operating Systems Principles (SOSP 2019).
  21. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
  22. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 58, 15 pages. https://doi.org/10.1145/3458817.3476209
  23. N NVIDIA. 2017. NVIDIA Collective Communications Library (NCCL).
  24. Nvidia V100 datasheet 2017. NVIDIA TESLA V100 GPU ACCELERATOR. https://www.penguinsolutions.com/computing/wp-content/uploads/2019/03/penguin-computing-tesla-v100-ds.pdf.
  25. NVLink 2017. NVIDIA NVLINK. http://www.nvidia.com/object/nvlink.html.
  26. NVSwitch 2018. NVIDIA NVSWITCH. http://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf.
  27. Quantization - PyTorch documentation 2023. Quantization - PyTorch documentation. https://pytorch.org/docs/stable/quantization.html.
  28. Language Models are Unsupervised Multitask Learners. (2019).
  29. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.
  30. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21).
  31. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association.
  32. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  33. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv preprint arXiv:2201.11990 (2022).
  34. Nikko Ström. 2015. Scalable distributed DNN training using commodity GPU cloud computing. (2015).
  35. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. CoRR abs/2102.02888 (2021). arXiv:2102.02888 https://arxiv.org/abs/2102.02888
  36. DeepSpeed Team and Rangan Majumder. 2020. DeepSpeed: Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/.
  37. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.
  38. Blink: Fast and Generic Collectives for Distributed ML. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/299.pdf
  39. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. CoRR abs/1904.00962 (2019). arXiv:1904.00962 http://arxiv.org/abs/1904.00962
  40. MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud. arXiv:2205.00119 [cs.DC]
  41. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 7543–7552. http://proceedings.mlr.press/v97/zhao19c.html
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Guanhua Wang (23 papers)
  2. Heyang Qin (6 papers)
  3. Sam Ade Jacobs (9 papers)
  4. Connor Holmes (20 papers)
  5. Samyam Rajbhandari (21 papers)
  6. Olatunji Ruwase (20 papers)
  7. Feng Yan (67 papers)
  8. Lei Yang (372 papers)
  9. Yuxiong He (59 papers)
Citations (44)