Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (1909.08053v4)

Published 17 Sep 2019 in cs.CL

Abstract: Recent work in LLMing demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that LLMs can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer LLM similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).

PDF Abstract

Efficient Training of Multi-Billion Parameter LLMs with Model Parallelism

Introduction

Recent advancements in NLP have been significantly propelled by the growth in model size, computational power, and data availability. The paper, "Megatron-LM: Training Multi-Billion Parameter LLMs Using Model Parallelism," explores the frontier of training extremely large transformer models. It presents a straightforward, effective intra-layer model parallel approach enabling the training of transformer models with up to billions of parameters. This development harnesses the existing computational infrastructure more efficiently, without necessitating specific hardware or substantial alterations to existing software frameworks.

Methodology

The paper outlines a method to circumvent the memory limitations typically encountered when training very large models. This is achieved by partitioning model parameters across multiple GPUs, allowing for the training of models that are significantly larger than what a single GPU could accommodate. The proposed model parallelism approach distinctly partitions matrix operations within transformer blocks across GPUs, effectively facilitating the training of substantial models like GPT-2 and BERT with billions of parameters.

One of the innovations in their approach is how the partitioning is done for both the feed-forward (MLP) components and the self-attention mechanism of the transformer model. The method ensures minimal communication overhead between GPUs, which is critical for maintaining training efficiency. Furthermore, the authors introduce a novel technique for synchronizing dropout masks across model parallel workers to maintain consistent regularization effects.

Results

Empirically, the paper demonstrates impressive scaling efficiencies, sustaining 15.1 PetaFLOPs across 512 GPUs with an 8.3 billion parameter transformer model, achieving a 76% scaling efficiency compared to a robust single GPU benchmark. Importantly, the model achieves state-of-the-art results on several NLP benchmarks, including WikiText103, LAMBADA, and RACE datasets, indicating that scalability does not compromise model performance.

The paper also explores the implications of different configurations, showing how the number of attention heads affects scaling efficiency. Furthermore, it provides a detailed analysis of its approach against traditional large-batch training methods, emphasizing the advantages of model parallelism in enabling larger model sizes without encountering the diminishing returns associated with simply increasing batch sizes.

Practical Implications and Theoretical Contributions

This research contributes both practically and theoretically to the field of deep learning and NLP. Practically, it presents a manageable way for researchers and practitioners to train models of unprecedented size on available hardware, potentially unlocking further advancements in machine understanding and generation of human language. Theoretically, it explores the boundaries of model scaling, providing insights into the trade-offs between model size, training efficiency, and performance on downstream tasks.

Speculations on Future Developments

The paper concludes by proposing directions for future research, including further scaling of model sizes, efficiency improvements in optimizers, and the exploration of hybrid model parallelism strategies. It also suggests investigating other model architectures and the potential of knowledge distillation from large models to smaller, more computationally efficient ones.

This important work opens many avenues for future exploration, potentially leading to more sophisticated, nuanced natural language processing capabilities. As models continue to grow, the methodologies developed in this research will likely serve as a foundational component in the training of next-generation NLP systems.