Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2104.04473v5)

Published 9 Apr 2021 in cs.CL and cs.DC

Abstract: LLMs have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at https://github.com/nvidia/megatron-lm.

PDF Abstract

Efficient Large-Scale LLM Training on GPU Clusters Using Megatron-LM

This paper presents an exploration into optimizing the training process of LLMs using novel parallelism techniques on GPU clusters. It addresses the computational and memory challenges posed by training contemporary models with up to a trillion parameters.

Context and Challenges

Training large models with transformer architectures has proven to be computationally demanding, requiring substantial GPU resources and sophisticated parallelism strategies. The complexity arises from limitations in GPU memory and the high number of compute operations necessary, impeding effective scaling when using naive parallelism approaches across multiple GPUs.

Parallelism Techniques

The paper proposes a composite parallelism strategy named PTD-P, which integrates tensor, pipeline, and data parallelism. The primary innovations and contributions are:

Tensor, Pipeline, and Data Parallelism: This combination allows the model to leverage thousands of GPUs efficiently. Tensor parallelism is implemented within a multi-GPU server, while pipeline parallelism extends across servers to manage inter-GPU communications effectively.
Interleaved Pipelining Schedule: A novel schedule is introduced to interleave stages, reducing pipeline flush inefficiencies, thus boosting throughput by more than 10% compared to existing approaches while maintaining comparable memory usage.

Results

The proposed methodology demonstrates commendable scaling capabilities, achieving an aggregate throughput of 502 petaFLOP/s on 3072 GPUs (equivalent to 52% of theoretical peak). The implementation achieves near-linear scaling, even with GPUs spread across 384 multi-GPU nodes. The results are particularly significant for large models such as GPT-3 and demonstrate feasible training times, estimated at around three months for a trillion-parameter model.

Analytical Insights

The paper explores the interactions between different parallelism dimensions:

Optimal Configuration: Empirical and analytical evaluations suggest that optimal throughput is achieved when tensor parallelism is applied up to the GPU limit per node, transitioning to pipeline parallelism beyond this scope. This minimizes inter-node communication overhead.
Microbatch Size and Hyperparameters: The choice of microbatch size critically affects performance, balancing computational intensity and pipeline bubble efficiency.

Practical Implications and Future Directions

The methods described pave the way for more efficient scaling of LLMs, which is crucial for both research and practical applications. The open-source Megatron-LM library offers a platform for further exploration into these strategies. Potential future research could explore automatic optimization of parallelism strategies to fully utilize heterogeneous hardware resources.

This work emphasizes the need to carefully integrate and balance different forms of parallelism to maximize computational resources while minimizing communication and memory bottlenecks. This foundational approach promises significant contributions to the practical application of LLMs in diverse fields, including NLP and beyond, facilitating advancements in AI with increasingly complex models.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Deepak Narayanan (26 papers)
Mohammad Shoeybi (60 papers)
Jared Casper (11 papers)
Patrick LeGresley (7 papers)
Mostofa Patwary (34 papers)
Vijay Anand Korthikanti (1 paper)
Dmitri Vainbrand (2 papers)
Prethvi Kashinkunti (1 paper)
Julie Bernauer (2 papers)
Bryan Catanzaro (123 papers)
Amar Phanishayee (23 papers)
Matei Zaharia (101 papers)

Citations (511)

View on Semantic Scholar