Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (2402.15627v1)

Published 23 Feb 2024 in cs.LG and cs.DC

Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training LLMs at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

Scaling LLM Training with MegaScale: Achievements at 10,000 GPU Scale

Introduction

MegaScale constitutes a significant advancement in the field of LLMs training, focusing on maximizing training efficiency and stability across an architecture scaling beyond 10,000 GPUs. Through a comprehensive design and implementation approach, the MegaScale system elevates the execution of training LLMs, addressing the dual challenges of achieving high training efficiency and ensuring stability throughout the extended training periods typical of LLMs.

Design Principles and System Overview

MegaScale embodies a full-stack approach, optimizing across various axes including model block and optimizer design, computation and communication overlapping, and network performance tuning. Central to its design philosophy are the principles of algorithm-system co-design and in-depth observability, facilitating optimizations that span the entirety of the system stack to ensure not only the efficiency but also the robustness required for large-scale deployments.

Algorithmic and System-Level Optimizations

The system introduces several key innovations:

  • Parallel Transformer Block and Sliding Window Attention techniques are adapted to support efficient model architecture modifications without sacrificing accuracy.
  • LAMB Optimizer adjustments allow scaling of the batch size significantly, enhancing throughput and reducing pipeline bubbles—a critical factor in large-scale model training.
  • Mixed Parallelism Strategies are utilized to strike an optimal balance between data parallelism, pipeline parallelism, tensor parallelism, and sequence parallelism, ensuring maximum hardware utilization.
  • Advanced Communication Overlapping Techniques are deployed to minimize the latency introduced by the heavy communication demands inherent in distributed LLM training, significantly improving Model FLOPs Utilization (MFU).
  • Custom Network Topology and Performance Tuning are undertaken to address the unique network performance challenges presented by the scale of the deployment.

Stability and Fault Tolerance

On the front of stability and fault tolerance, MegaScale demonstrates a robust training framework suited to the demands of LLM training at scale:

  • The introduction of Automated Diagnostic and Recovery Mechanisms ensures that the system can identify, diagnose, and recover from a wide array of faults with minimal intervention, maintaining high levels of effective training time.
  • In-Depth Observability Tools have been developed to provide granular insights into system performance and behavior, enabling rapid identification and resolution of both anticipated and unforeseen issues.

Performance and Operational Experience

MegaScale's design and optimizations have led to notable practical achievements in the training of LLMs:

  • Efficiency Improvement: In comparative benchmarks, MegaScale achieved a 55.2% MFU when training a 175 billion parameter model across 12,288 GPUs—a 1.34× improvement over the previous state-of-the-art, Megatron-LM.
  • Stability in Long-Term Runs: Real-world deployment scenarios demonstrate the system's capability to maintain model convergence and effectively manage faults over extended periods, showcasing the maturity of its fault tolerance mechanisms.
  • Operational Insights: The system's operational deployment yielded valuable insights, particularly concerning the diagnosis and resolution of computational stragglers and network performance issues, underscoring the practical benefits of its diagnostic tools and robust training framework.

Implications and Future Directions

The achievements of MegaScale in LLM training represent a significant step forward in the field of AI systems research, providing a scalable, efficient, and robust framework for the development of next-generation AI models. The experiences and insights derived from the MegaScale project also highlight areas for future research, particularly in the realms of fault diagnosis and recovery in vast distributed systems, further optimizations in communication strategies, and the continuous need for innovations in model and optimizer design.

With the ongoing rapid evolution of LLMs and their applications, MegaScale not only sets new benchmarks for large-scale model training but also opens up pathways for future advancements in AI systems design and implementation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  2. “Introducing chatgpt.” https://openai.com/blog/chatgpt, 2022.
  3. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.
  4. L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
  5. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  7. D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” 2021.
  8. I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” 2020.
  9. Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” in International Conference on Learning Representations, 2020.
  10. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2020.
  11. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models.” ArXiv, May 2020.
  12. Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism ,” in NeurIPS, 2019.
  13. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: Generalized pipeline parallelism for dnn training,” in ACM SOSP, 2019.
  14. B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.” https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  15. Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y. Hao, A. Mathews, and S. Li, “Pytorch fsdp: Experiences on scaling fully sharded data parallel,” 2023.
  16. T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
  17. S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed: Experiences on accelerating data parallel training,” 2020.
  18. I. Group, “Ieee 802.1 qbb - priority-based flow control.” https://1.ieee802.org/dcb/802-1qbb/, 2009.
  19. Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion Control for Large-scale RDMA Deployments,” ACM SIGCOMM Computer Communication Review, vol. 45, no. 4, pp. 523–536, 2015.
  20. G. Kumar, N. Dukkipati, K. Jang, H. M. Wassel, X. Wu, B. Montazeri, Y. Wang, K. Springborn, C. Alfeld, M. Ryan, et al., “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter,” in SIGCOMM, pp. 514–528, 2020.
  21. “Megatron-LM.” https://github.com/NVIDIA/Megatron-LM/tree/main, 2021.
  22. V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
  23. OpenAI, “Gpt-4 technical report,” 2023.
  24. D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” 2020.
  25. A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan, “A general language assistant as a laboratory for alignment,” 2021.
  26. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” 2022.
  27. S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, E. Zhang, R. Child, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Houston, S. Tiwary, and B. Catanzaro, “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” 2022.
  28. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” 2022.
  29. H. Su, X. Zhou, H. Yu, X. Shen, Y. Chen, Z. Zhu, Y. Yu, and J. Zhou, “Welm: A well-read pre-trained language model for chinese,” 2023.
  30. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. ChenG, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022.
  31. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
  32. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
  33. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” 2023.
  34. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model.” https://github.com/tatsu-lab/stanford_alpaca, 2023.
  35. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023.
  36. X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song, “Koala: A dialogue model for academic research.” Blog post, April 2023.
  37. Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, B. Ma, and X. Li, “Belle: Be everyone’s large language model engine.” https://github.com/LianjiaTech/BELLE, 2023.
  38. Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train big, then compress: Rethinking model size for efficient training and inference of transformers,” in International Conference on Machine Learning (ICML), 2020.
  39. G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” arXiv, 2022.
  40. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv, 2022.
  41. T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm. int8 (): 8-bit matrix multiplication for transformers at scale,” arXiv, 2022.
  42. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021.
  43. R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019.
  44. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” 2020.
  45. C. Zhu, W. Ping, C. Xiao, M. Shoeybi, T. Goldstein, A. Anandkumar, and B. Catanzaro, “Long-short transformer: Efficient transformers for language and vision,” 2021.
  46. B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV, X. He, H. Hou, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, J. Zhu, and R.-J. Zhu, “Rwkv: Reinventing rnns for the transformer era,” 2023.
  47. Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” 2023.
  48. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding,” in NeurIPS, 2017.
  49. P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed Precision Training,” in ICLR, 2018.
  50. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in OSDI, 2016.
  51. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library ,” in NeurIPS, 2019.
  52. A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, and G. Pekhimenko, “Priority-based Parameter Propagation for Distributed DNN Training ,” in MLSys, 2019.
  53. S. H. Hashemi, S. Abdu Jyothi, and R. Campbell, “TicTac: Accelerating Distributed Deep Learning with Communication Scheduling,” in MLSys, 2019.
  54. Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed DNN training acceleration,” in SOSP, 2019.
  55. Y. Bao, Y. Peng, Y. Chen, and C. Wu, “Preemptive All-reduce Scheduling for Expediting Distributed DNN Training,” in INFOCOM, 2020.
  56. Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing, “Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training,” in NeurIPS, 2018.
  57. Y. Chen, C. Xie, M. Ma, J. Gu, Y. Peng, H. Lin, C. Wu, and Y. Zhu, “Sapipe: Staleness-aware pipeline for data parallel dnn training,” Advances in Neural Information Processing Systems, vol. 35, pp. 17981–17993, 2022.
  58. C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien, “Pingmesh: A large-scale system for data center network latency measurement and analysis,” SIGCOMM Comput. Commun. Rev., vol. 45, p. 139–152, aug 2015.
  59. Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng, “Packet-level telemetry in large datacenter networks,” SIGCOMM Comput. Commun. Rev., vol. 45, p. 479–491, aug 2015.
  60. Y. Li, R. Miao, C. Kim, and M. Yu, “Lossradar: Fast detection of lost packets in data center networks,” in Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’16, (New York, NY, USA), p. 481–495, Association for Computing Machinery, 2016.
  61. C. Tan, Z. Jin, C. Guo, T. Zhang, H. Wu, K. Deng, D. Bi, and D. Xiang, “Netbouncer: Active device and link failure localization in data center networks,” in Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation, NSDI’19, (USA), p. 599–613, USENIX Association, 2019.
  62. K. Liu, Z. Jiang, J. Zhang, H. Wei, X. Zhong, L. Tan, T. Pan, and T. Huang, “Hostping: Diagnosing intra-host network bottlenecks in RDMA servers,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), (Boston, MA), pp. 15–29, USENIX Association, April 2023.
  63. S. Haider, N. R. Ansari, M. Akbar, and M. R. Perwez, “Fault tolerance in distributed paradigms,” 2011.
  64. Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, “Optimus: an efficient dynamic resource scheduler for deep learning clusters,” in Proceedings of the Thirteenth EuroSys Conference, pp. 1–14, 2018.
  65. A. S. Tanenbaum, Distributed systems principles and paradigms. 2007.
  66. S. Chakravorty, C. L. Mendes, and L. V. Kalé, “Proactive fault tolerance in mpi applications via task migration,” in International Conference on High-Performance Computing, pp. 485–496, Springer, 2006.
  67. S. Chakravorty, C. Mendes, and L. V. Kale, “Proactive fault tolerance in large systems,” in HPCRI Workshop in conjunction with HPCA, vol. 2005, pp. 1–7, Citeseer, 2005.
  68. Y. Chen, Y. Peng, Y. Bao, C. Wu, Y. Zhu, and C. Guo, “Elastic parameter server load distribution in deep learning clusters,” in Proceedings of the 11th ACM Symposium on Cloud Computing, pp. 507–521, 2020.
  69. I. Behera and C. R. Tripathy, “Performance modelling and analysis of mobile grid computing systems,” International Journal of Grid and Utility Computing, vol. 5, no. 1, pp. 11–20, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (32)
  1. Ziheng Jiang (23 papers)
  2. Haibin Lin (35 papers)
  3. Yinmin Zhong (11 papers)
  4. Qi Huang (75 papers)
  5. Yangrui Chen (15 papers)
  6. Zhi Zhang (113 papers)
  7. Yanghua Peng (18 papers)
  8. Xiang Li (1003 papers)
  9. Cong Xie (33 papers)
  10. Shibiao Nong (2 papers)
  11. Yulu Jia (1 paper)
  12. Sun He (1 paper)
  13. Hongmin Chen (2 papers)
  14. Zhihao Bai (5 papers)
  15. Qi Hou (13 papers)
  16. Shipeng Yan (15 papers)
  17. Ding Zhou (10 papers)
  18. Yiyao Sheng (2 papers)
  19. Zhuo Jiang (7 papers)
  20. Haohan Xu (2 papers)
Citations (51)