Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (2201.11990v3)

Published 28 Jan 2022 in cs.CL

Abstract: Pretrained general-purpose LLMs can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based LLM, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale LLMs, and natural language generations.

PDF Abstract

An Overview of Megatron-Turing NLG 530B: Training a Large-Scale Generative LLM

The rapidly growing field of large-scale LLMs has introduced a need for robust hardware, sophisticated software, and innovative algorithmic methodologies. A recent contribution to this evolving landscape is the Megatron-Turing NLG 530B (MT-NLG), a monolithic transformer model with an impressive scale of 530 billion parameters. The necessity to accommodate such large-scale models has been driven by the pursuit of state-of-the-art accuracies in a variety of NLP tasks. The paper detailing MT-NLG offers insights into the infrastructure, methodologies, and performance evaluations crucial for training a model of this magnitude.

Infrastructure and Methodological Insights

The joint effort between Microsoft and NVIDIA has led to significant advancements in training methodologies, particularly through the combination of DeepSpeed and Megatron in facilitating 3D parallelism. MT-NLG's training necessitated a delicate balance of data, model, and pipeline parallelism to optimize memory and compute efficiency across thousands of NVIDIA A100 GPUs. The paper discusses topology-aware mapping strategies that minimize communication overhead, especially for data parallelism, underscoring the importance of bandwidth optimization within a high-performance computing environment.

Training Dataset and Process

The selection of training data is pivotal to model performance, and the authors leveraged a comprehensive mix of curated datasets, including segments from Common Crawl, to ensure diversity and quality. Detailing their preprocessing strategies, the paper highlights crucial steps such as text extraction, filtering of low-quality data, fuzzy deduplication, and careful dataset blending to maintain representativeness across billions of tokens. The use of a transformer decoder architecture informed by prior advancements like GPT-2 further solidifies the model's design for scalable learning.

Numerical Results and Evaluations

MT-NLG demonstrates significant improvement over previous models by achieving superior performance in zero-shot, one-shot, and few-shot learning scenarios across multiple NLP benchmarks. The model's zero-shot capabilities challenge existing state-of-the-art models, showing strong performance in tasks such as LAMBADA and BoolQ. Notably, while the gap with supervised learning models persists, MT-NLG offers promising advancements in reducing this disparity without task-specific fine-tuning.

Addressing and Evaluating Social Biases

Acknowledging the challenges of bias in LLMs, the authors conduct a preliminary examination of biases related to gender, ethnicity, and religion within MT-NLG. They employ association tests, adjective co-occurrence analyses, and sentiment evaluations to explore inherent biases. These analyses underscore the significance of employing anti-bias strategies in future developments, emphasizing that current large-scale models inherently reflect biases present in training data.

In-Context Learning and Model Understanding

A critical component of the paper is the examination of MT-NLG's language understanding capabilities via the HANS dataset. The results illustrate the model's proficiency in parsing grammatical and syntactic rules, especially as model scale and training token counts increase. However, the paper also reveals inherent biases and reliance on heuristics, prompting further investigation into optimizing few-shot learning strategies and understanding the disparity in task distributions.

Implications and Future Directions

The development of MT-NLG is a substantial step toward advancing large-scale LLMing, showcasing the potential and challenges of training models at this scale. The detailed breakdown of infrastructure, training processes, and evaluations provides a basis for further exploration into efficient model training, bias mitigation, and improved few-shot learning capabilities. The insights gleaned from this work offer a pathway for future research aimed at refining large-scale pretraining frameworks and exploring their comprehensive applications and limitations.

PDF Markdown Bookmark Chat (Pro)

Authors (20)

Shaden Smith (7 papers)
Mostofa Patwary (34 papers)
Brandon Norick (6 papers)
Patrick LeGresley (7 papers)
Samyam Rajbhandari (21 papers)
Jared Casper (11 papers)
Zhun Liu (7 papers)
Shrimai Prabhumoye (40 papers)
George Zerveas (10 papers)
Vijay Korthikanti (7 papers)
Elton Zhang (1 paper)
Rewon Child (10 papers)
Reza Yazdani Aminabadi (10 papers)
Julie Bernauer (2 papers)
Xia Song (38 papers)
Mohammad Shoeybi (60 papers)
Yuxiong He (59 papers)
Michael Houston (4 papers)
Saurabh Tiwary (15 papers)
Bryan Catanzaro (123 papers)

Citations (673)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos