Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey (2407.20018v1)

Published 29 Jul 2024 in cs.DC

Abstract: LLMs like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explores recent advancements in training systems for LLMs, including innovations in training infrastructure with AI accelerators, networking, storage, and scheduling. Additionally, the survey covers parallelism strategies, as well as optimizations for computation, communication, and memory in distributed LLM training. It also includes approaches of maintaining system reliability over extended training periods. By examining current innovations and future directions, this survey aims to provide valuable insights towards improving LLM training systems and tackling ongoing challenges. Furthermore, traditional digital circuit-based computing systems face significant constraints in meeting the computational demands of LLMs, highlighting the need for innovative solutions such as optical computing and optical networks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Jiangfei Duan (8 papers)
  2. Shuo Zhang (256 papers)
  3. Zerui Wang (12 papers)
  4. Lijuan Jiang (3 papers)
  5. Wenwen Qu (2 papers)
  6. Qinghao Hu (31 papers)
  7. Guoteng Wang (6 papers)
  8. Qizhen Weng (5 papers)
  9. Hang Yan (86 papers)
  10. Xingcheng Zhang (29 papers)
  11. Xipeng Qiu (257 papers)
  12. Dahua Lin (336 papers)
  13. Yonggang Wen (84 papers)
  14. Xin Jin (285 papers)
  15. Tianwei Zhang (199 papers)
  16. Peng Sun (210 papers)
Citations (2)