Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Memory and Communication Cost for Efficient Large Language Model Training (2310.06003v2)

Published 9 Oct 2023 in cs.LG and cs.AI

Abstract: Recently, various distributed strategies for LLM training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of LLMs, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in LLM training. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Chan Wu (2 papers)
  2. Hanxiao Zhang (24 papers)
  3. Lin Ju (10 papers)
  4. Jinjing Huang (1 paper)
  5. Youshao Xiao (6 papers)
  6. Zhaoxin Huan (10 papers)
  7. Siyuan Li (140 papers)
  8. Fanzhuang Meng (2 papers)
  9. Lei Liang (37 papers)
  10. Xiaolu Zhang (39 papers)
  11. Jun Zhou (370 papers)
Citations (4)