Large Scale Language Modeling: Converging on 40GB of Text in Four Hours (1808.01371v2)

Published 3 Aug 2018 in cs.LG, cs.CL, and stat.ML

Abstract: Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets, then transfer the knowledge gained from these models to a variety of tasks. Following [Radford 2017], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks. By utilizing mixed precision arithmetic and a 32k batch size distributed across 128 NVIDIA Tesla V100 GPUs, we are able to train a character-level 4096-dimension multiplicative LSTM (mLSTM) for unsupervised text reconstruction over 3 epochs of the 40 GB Amazon Reviews dataset in four hours. This runtime compares favorably with previous work taking one month to train the same size and configuration for one epoch over the same dataset. Converging large batch RNN models can be challenging. Recent work has suggested scaling the learning rate as a function of batch size, but we find that simply scaling the learning rate as a function of batch size leads either to significantly worse convergence or immediate divergence for this problem. We provide a learning rate schedule that allows our model to converge with a 32k batch size. Since our model converges over the Amazon Reviews dataset in hours, and our compute requirement of 128 Tesla V100 GPUs, while substantial, is commercially available, this work opens up large scale unsupervised NLP training to most commercial applications and deep learning researchers. A model can be trained over most public or private text datasets overnight.

Authors (4)

Raul Puri (12 papers)
Robert Kirby (14 papers)
Nikolai Yakovenko (4 papers)
Bryan Catanzaro (123 papers)

Citations (29)

View on Semantic Scholar

Summary

A Comprehensive Analysis of Large Scale LLMing

The paper "Large Scale LLMing: Converging on 40GB of Text in Four Hours" presents significant advancements in the domain of training recurrent neural networks (RNNs) for NLP tasks, particularly focusing on large-scale datasets. Authored by a team from NVIDIA, the paper leverages state-of-the-art hardware, namely, the NVIDIA Tesla V100 GPUs, and advanced optimization techniques to address scalability issues associated with RNN LLMs.

The essence of this research is the exploration of techniques that expedite the training process of LLMs, using an extensive corpus such as the Amazon Reviews dataset. The authors employ a 4096-dimension multiplicative LSTM (mLSTM) for character-level unsupervised text reconstruction. Notably, the paper achieves an impressive reduction in training time, converging the model in just four hours with the aid of mixed precision arithmetic, distributed data parallelism, and innovative learning rate schedules.

Key Contributions and Methodologies

Mixed Precision Training: The researchers utilize mixed precision arithmetic, combining FP16 and FP32 calculations, to enhance computational efficiency on Tesla V100 GPUs. This approach significantly speeds up the training process, delivering a 4.2x gain in performance compared to single precision training.
Distributed Data Parallelism: By employing 128 GPUs, the paper demonstrates near-linear scalability for training RNNs, achieving a 109x increase in data throughput. This is indicative of the efficiency of NCCL2 for parallel GPU communication alongside NVLink and Infiniband interconnects.
Learning Rate Scheduling: The paper emphasizes the complexity of training RNNs with large batch sizes, highlighting the need for specialized learning rate schedules to prevent convergence issues such as divergence or suboptimal generalization. The authors introduce a tailored learning rate schedule that maintains stability during the optimization process.
Large Batch Training Challenges: The paper confronts the difficulties of large batch training, noting the need for more data or extended training epochs to achieve equivalent generalization performance compared to smaller batch sizes. This is partially addressed through increased datasets and alternative unsupervised tasks, though the paper notes the constraints of public datasets regarding satisfying the $B \ll N$ condition for efficient scaling.

Implications and Future Directions

The practical implications of this research are substantial, offering a pathway for commercial applications and researchers to perform large-scale unsupervised NLP training within feasible timeframes using commercially available resources. Model generalization and transfer learning benefits are underscored by experiments on sentiment classification tasks, yielding competitive accuracy levels.

For future developments, the authors suggest extending this work with larger corpora, potentially composed of various web content sources. Furthermore, alternative architectures, like Transformer networks, hyperparameter tuning, and enhanced learning rate schedules present avenues for exploration. Techniques like gradient checkpointing could allow training of even larger models by circumventing memory limitations.

Conclusion

This paper meaningfully contributes to the understanding and potential of RNNs in LLMing, demonstrating the synergy between hardware capability and algorithmic innovation. While challenges remain in scaling batch sizes effectively with available datasets, the insights and methodologies presented pave the way for continuous progress in the field, enhancing our capacity to model complex natural language tasks on a vast scale.

PDF Markdown

Related Papers

Find Related Papers