Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (1904.00962v5)

Published 1 Apr 2019 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

PDF Abstract

Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes

The paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" by Yang You et al., presents an innovative approach towards efficient training of deep neural networks using large batch sizes. The major contribution lies in introducing LAMB (Layerwise Adaptive Moments for Batch training) optimizer, which extends the principles of LARS (Layer-wise Adaptive Rate Scaling) to enhance training efficiency across various neural network architectures.

Key Contributions

The paper makes several significant contributions:

General Adaptation Strategy: The authors propose a principled layerwise adaptation strategy to address the shortcomings of LARS, particularly with attention models like BERT. This strategy ensures that learning rates are scaled adaptively per layer based on the norm of parameters and gradients.
LAMB Optimizer: Developed based on the adaptation strategy, LAMB employs both layerwise adaptive learning rates and element-wise adaptive moments to achieve better convergence in large batch settings. LAMB effectively extends LARS by integrating Adam’s element-wise adaptive learning rate strategy, making it suitable for non-convolutional networks.
Convergence Analysis: The paper provides a comprehensive convergence analysis for both LARS and LAMB in nonconvex settings, demonstrating theoretical justifications for the enhanced performance. The analysis shows convergence to a stationary point and quantifies the benefits of layerwise adaptation in large batch settings, highlighting superior scalability compared to standard SGD.
Empirical Results: Empirical evaluations show that LAMB significantly outperforms existing large batch optimizers in training BERT and ResNet-50. Notably, in the BERT pre-training task, LAMB achieves a substantial reduction in training time from 3 days to 76 minutes using batch sizes up to 64k, without any performance degradation.

Detailed Algorithmic Insight

LARS Algorithm

LARS adapts learning rates for each layer by scaling them based on a ratio of the norm of parameters to the norm of the gradient. This ensures that updates are proportionate to the parameter size, which is particularly beneficial in convolutional networks. However, LARS struggles with models that aren't naturally suited to this method, such as transformer-based models.

LAMB Algorithm

LAMB extends LARS by using adaptive moments similar to Adam. The algorithm calculates the first and second moment estimates of gradients and normalizes the updates by the root of the second moment. Layerwise adaptation is achieved by scaling the learning rate according to the ratio of parameter norms to gradient norms. This two-fold adaptivity (element-wise and layer-wise) allows LAMB to perform robustly across different architectures, including transformers.

Practical Implications

The implications of the proposed LAMB optimizer are substantial:

Efficient Training: By enabling stable training with very large batch sizes, LAMB dramatically reduces wall-clock time. This is crucial for deploying large-scale models in production environments where training time directly impacts model readiness.
Minimal Hyperparameter Tuning: Unlike many optimizers that require extensive hyperparameter tuning when scaling batch sizes, LAMB requires minimal or no additional tuning. This simplifies the training process and reduces the computational overhead associated with hyperparameter optimization.
Broad Applicability: LAMB’s adaptability makes it suitable for a wide range of deep learning models, from image classification networks like ResNet to complex attention models like BERT, signifying its versatility.

Future Directions

The success of LAMB opens several avenues for future research:

Further Theoretical Analysis: While the convergence analysis provided is robust, further explorations into the theoretical underpinnings of mixed adaptive methods could yield deeper insights and potentially new optimizations.
Extended Applicability: As the framework of neural networks evolves, especially with the advent of new architectures like vision transformers and large-scale generative models, adapting and validating LAMB in these contexts would further establish its utility.
Hardware Optimization: Given that LAMB demonstrates substantial scaling efficiency on TPUs, investigating its performance and adaptation on other hardware accelerators like GPUs and specialized deep learning chips can provide broader applicability in various computational environments.

In summary, the work of Yang You et al. on LAMB introduces a significant advancement in the domain of large-batch optimization, significantly impacting the practical training of large neural networks. The comprehensive convergence analysis and empirical validation position LAMB as a highly effective tool for accelerating model training with enhanced stability and performance across diverse architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yang You (173 papers)
Jing Li (621 papers)
Sashank Reddi (10 papers)
Jonathan Hseu (2 papers)
Sanjiv Kumar (123 papers)
Srinadh Bhojanapalli (44 papers)
Xiaodan Song (13 papers)
James Demmel (54 papers)
Kurt Keutzer (199 papers)
Cho-Jui Hsieh (211 papers)

Citations (927)

View on Semantic Scholar

Related Papers

Find Related Papers

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (1904.00962v5)