Don't Decay the Learning Rate, Increase the Batch Size (1711.00489v2)

Published 1 Nov 2017 in cs.LG, cs.CV, cs.DC, and stat.ML

Abstract: It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to $76.1\%$ validation accuracy in under 30 minutes.

Authors (4)

Samuel L. Smith (27 papers)
Pieter-Jan Kindermans (19 papers)
Chris Ying (6 papers)
Quoc V. Le (128 papers)

Citations (946)

View on Semantic Scholar

Summary

An Analytical Insight into "Don't Decay the Learning Rate, Increase the Batch Size"

The paper "Don't Decay the Learning Rate, Increase the Batch Size" authored by Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le presents a compelling exploration of optimization strategies in deep learning. Specifically, the research challenges the conventional wisdom of decaying the learning rate and instead advocates for increasing the batch size during training. This essay provides an exhaustive overview of the key contributions, empirical validations, and implications elucidated within the paper.

Key Contributions

Equivalence of Decaying Learning Rate and Increasing Batch Size: The paper introduces a principal assertion that increasing the batch size during training can replicate the effects of decaying the learning rate. This hypothesis is substantiated through rigorous experiments demonstrating nearly identical learning curves for both training methods across several optimization algorithms, including SGD, SGD with momentum, Nesterov momentum, and Adam.
Reduction in Parameter Updates: Empirical evidence shows that the proposed method—where batch size is increased according to a predefined schedule—leads to fewer parameter updates while maintaining comparable test accuracies. This efficiency is further amplified by aggressive scaling of the batch size in proportion to the learning rate, which significantly reduces training time.
Hyper-Parameter-free Repurposing of Training Schedules: The authors propose that existing learning rate decay schedules can be converted to batch size increasing schedules without any additional hyper-parameter tuning. This practical implementation simplifies the adoption of large batch training and leverages existing empirically proven learning rate schedules.
Application to Real-World Tasks: The practicality of the proposed methodology is validated by training ResNet-50 on ImageNet to achieve 76.1% validation accuracy within 30 minutes, rivalling state-of-the-art training times achieved with traditional methods.

Detailed Empirical Analysis

The experimental section is meticulously designed to validate the theoretical claims. The core results are:

Wide ResNet on CIFAR-10: Experiments with a "16-4" Wide ResNet on CIFAR-10 illustrate that schedules with increasing batch sizes yield training curves in both training set cross-entropy and test accuracy that are indistinguishable from those that decay the learning rate. Notably, increasing the batch size reduces the number of parameter updates required by up to a factor of three.
Inception-ResNet-V2 on ImageNet: Further experiments on the more complex ImageNet dataset using the Inception-ResNet-V2 architecture demonstrate the viability of increasing batch size training with large initial learning rates, achieving 77.5% validation accuracy in under 2500 parameter updates—significantly fewer than the 14000 updates required using decaying learning rates.

Theoretical Foundations and Implications

The theoretical underpinning for these findings is rooted in the interpretation of SGD as integrating a stochastic differential equation. The noise scale, defined as $g = \epsilon \left(\frac{N}{B} - 1\right)$ , where $\epsilon$ is the learning rate, $N$ is the training set size, and $B$ is the batch size, plays a critical role. By increasing $B$ , the noise in the gradient updates can be controlled similarly to how it is controlled by decaying $\epsilon$ .

Implications:

Efficient Use of Computational Resources: The reduced number of parameter updates highlights the potential for significant computational savings and shorter training times. This is particularly beneficial when leveraging parallel computing frameworks, such as GPUs and TPUs, which are optimized for large batch operations.
Enhanced Scalability: The ability to scale batch sizes dynamically while following existing training schedules simplifies the training of large-scale models, making the approach broadly applicable across various deep learning tasks and architectures.
Future Directions: There is potential for further exploration into the upper bounds of batch size scalability and the effects of larger momentum coefficients. Future research could also investigate the integration of this strategy with other optimization techniques, such as adaptive learning rates and second-order methods.

Conclusion

The research presented in "Don't Decay the Learning Rate, Increase the Batch Size" offers a meticulous examination of an alternative to the traditional learning rate decay strategy in deep learning training paradigms. By demonstrating the effectiveness and efficiency of increasing the batch size, the authors provide a significant contribution to the field, promising implications for both theoretical optimization and practical implementations. As hardware capabilities continue to evolve, the findings of this paper offer a forward-looking perspective on optimizing model training times and leveraging parallelism in large-scale neural network training.