ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology (2402.03791v3)

Published 6 Feb 2024 in cs.DC

Abstract: Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication overheads and complexity in modifying single-GPU code. In this paper, we propose a TP-free distributed framework ZeroPP, which leverages the hybrid of scalable inter-operator pipeline parallelism and intra-operator fully sharded data parallelism to train models at scale, reducing memory consumption and enabling high training efficiency. Through extensive experimentation, we demonstrate that ZeroPP achieves significant performance gains of up to 33% compared to conventional 3D parallelism while maintaining comparable GPU memory consumption.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (31)

Authors (8)

Ding Tang (3 papers)
Lijuan Jiang (3 papers)
Minxi Jin (1 paper)
Jiecheng Zhou (5 papers)
Hengjie Li (8 papers)
Xingcheng Zhang (29 papers)
Zhilin Pei (6 papers)
Jidong Zhai (24 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/HPCPapers/status/1755110061540794815

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology (2402.03791v3)

Related Papers

Tweets