Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

60 2

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (2402.15627v1)

Published 23 Feb 2024 in cs.LG and cs.DC

Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training LLMs at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

PDF HTML Abstract

Scaling LLM Training with MegaScale: Achievements at 10,000 GPU Scale

Introduction

MegaScale constitutes a significant advancement in the field of LLMs training, focusing on maximizing training efficiency and stability across an architecture scaling beyond 10,000 GPUs. Through a comprehensive design and implementation approach, the MegaScale system elevates the execution of training LLMs, addressing the dual challenges of achieving high training efficiency and ensuring stability throughout the extended training periods typical of LLMs.

Design Principles and System Overview

MegaScale embodies a full-stack approach, optimizing across various axes including model block and optimizer design, computation and communication overlapping, and network performance tuning. Central to its design philosophy are the principles of algorithm-system co-design and in-depth observability, facilitating optimizations that span the entirety of the system stack to ensure not only the efficiency but also the robustness required for large-scale deployments.

Algorithmic and System-Level Optimizations

The system introduces several key innovations:

Parallel Transformer Block and Sliding Window Attention techniques are adapted to support efficient model architecture modifications without sacrificing accuracy.
LAMB Optimizer adjustments allow scaling of the batch size significantly, enhancing throughput and reducing pipeline bubbles—a critical factor in large-scale model training.
Mixed Parallelism Strategies are utilized to strike an optimal balance between data parallelism, pipeline parallelism, tensor parallelism, and sequence parallelism, ensuring maximum hardware utilization.
Advanced Communication Overlapping Techniques are deployed to minimize the latency introduced by the heavy communication demands inherent in distributed LLM training, significantly improving Model FLOPs Utilization (MFU).
Custom Network Topology and Performance Tuning are undertaken to address the unique network performance challenges presented by the scale of the deployment.

Stability and Fault Tolerance

On the front of stability and fault tolerance, MegaScale demonstrates a robust training framework suited to the demands of LLM training at scale:

The introduction of Automated Diagnostic and Recovery Mechanisms ensures that the system can identify, diagnose, and recover from a wide array of faults with minimal intervention, maintaining high levels of effective training time.
In-Depth Observability Tools have been developed to provide granular insights into system performance and behavior, enabling rapid identification and resolution of both anticipated and unforeseen issues.

Performance and Operational Experience

MegaScale's design and optimizations have led to notable practical achievements in the training of LLMs:

Efficiency Improvement: In comparative benchmarks, MegaScale achieved a 55.2% MFU when training a 175 billion parameter model across 12,288 GPUs—a 1.34× improvement over the previous state-of-the-art, Megatron-LM.
Stability in Long-Term Runs: Real-world deployment scenarios demonstrate the system's capability to maintain model convergence and effectively manage faults over extended periods, showcasing the maturity of its fault tolerance mechanisms.
Operational Insights: The system's operational deployment yielded valuable insights, particularly concerning the diagnosis and resolution of computational stragglers and network performance issues, underscoring the practical benefits of its diagnostic tools and robust training framework.

Implications and Future Directions

The achievements of MegaScale in LLM training represent a significant step forward in the field of AI systems research, providing a scalable, efficient, and robust framework for the development of next-generation AI models. The experiences and insights derived from the MegaScale project also highlight areas for future research, particularly in the realms of fault diagnosis and recovery in vast distributed systems, further optimizations in communication strategies, and the continuous need for innovations in model and optimizer design.

With the ongoing rapid evolution of LLMs and their applications, MegaScale not only sets new benchmarks for large-scale model training but also opens up pathways for future advancements in AI systems design and implementation.

PDF Markdown Bookmark Chat (Pro)

References (69)

Authors (32)

Ziheng Jiang (23 papers)
Haibin Lin (35 papers)
Yinmin Zhong (11 papers)
Qi Huang (75 papers)
Yangrui Chen (15 papers)
Zhi Zhang (113 papers)
Yanghua Peng (18 papers)
Xiang Li (1003 papers)
Cong Xie (33 papers)
Shibiao Nong (2 papers)
Yulu Jia (1 paper)
Sun He (1 paper)
Hongmin Chen (2 papers)
Zhihao Bai (5 papers)
Qi Hou (13 papers)
Shipeng Yan (15 papers)
Ding Zhou (10 papers)
Yiyao Sheng (2 papers)
Zhuo Jiang (7 papers)
Haohan Xu (2 papers)

Citations (51)

View on Semantic Scholar

Tweets

https://twitter.com/YouJiacheng/status/1841083970051850586

https://twitter.com/amit_sethi/status/1886394462764724641

https://twitter.com/_clarktang/status/1767084998216020206

https://twitter.com/ZihengJiang/status/1762582477846221154

https://twitter.com/YouJiacheng/status/1841085924719149572

https://twitter.com/DeepgramAI/status/1765419076074274946

HackerNews

MegaScale: Scaling Large Language Model Training to More Than 10k GPUs (2 points, 0 comments)