Rethinking Memory and Communication Cost for Efficient Large Language Model Training (2310.06003v2)
Abstract: Recently, various distributed strategies for LLM training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of LLMs, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in LLM training. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
- Chan Wu (2 papers)
- Hanxiao Zhang (24 papers)
- Lin Ju (10 papers)
- Jinjing Huang (1 paper)
- Youshao Xiao (6 papers)
- Zhaoxin Huan (10 papers)
- Siyuan Li (140 papers)
- Fanzhuang Meng (2 papers)
- Lei Liang (37 papers)
- Xiaolu Zhang (39 papers)
- Jun Zhou (370 papers)