ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology (2402.03791v3)
Abstract: Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication overheads and complexity in modifying single-GPU code. In this paper, we propose a TP-free distributed framework ZeroPP, which leverages the hybrid of scalable inter-operator pipeline parallelism and intra-operator fully sharded data parallelism to train models at scale, reducing memory consumption and enabling high training efficiency. Through extensive experimentation, we demonstrate that ZeroPP achieves significant performance gains of up to 33% compared to conventional 3D parallelism while maintaining comparable GPU memory consumption.
- Maximizing parallelism in distributed training for huge neural networks. arXiv preprint arXiv:2105.14450, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- All nlp tasks are generation tasks: A general pretraining framework, 2021.
- Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445, 2021.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Gems: Gpu-enabled memory-aware model-parallelism system for distributed dnn training. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp. 2, 2019.
- Lamy-Poirier, J. Breadth-first pipeline parallelism. Proceedings of Machine Learning and Systems, 5, 2023.
- Deep learning. nature, 521(7553):436–444, 2015.
- Li, S. Getting started with distributed data parallel. 2023.
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14, 2021.
- Colossal-ai: A unified deep learning system for large-scale parallel training. 2021.
- Hanayo: Harnessing wave-like pipeline parallelism for enhanced large model training efficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13, 2023.
- Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15, 2019.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pp. 7937–7947. PMLR, 2021a.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, 2021b.
- Out-of-order backprop: An effective scheduling technique for deep learning. In Proceedings of the Seventeenth European Conference on Computer Systems, pp. 435–452, 2022.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- 2.5-dimensional distributed model training. arXiv e-prints, pp. arXiv–2105, 2021.
- Tesseract: Parallelize the tensor parallelism efficiently. In Proceedings of the 51st International Conference on Parallel Processing, pp. 1–11, 2022.
- An efficient 2d method for training super-large deep learning models. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 222–232. IEEE, 2023.
- Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3:269–296, 2021.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp. 559–578, 2022.
- On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems, 5, 2023.
- Ding Tang (3 papers)
- Lijuan Jiang (3 papers)
- Minxi Jin (1 paper)
- Jiecheng Zhou (5 papers)
- Hengjie Li (8 papers)
- Xingcheng Zhang (29 papers)
- Zhilin Pei (6 papers)
- Jidong Zhai (24 papers)