Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (1706.02677v2)

Published 8 Jun 2017 in cs.CV, cs.DC, and cs.LG

Abstract: Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

Authors (9)

Priya Goyal (15 papers)
Piotr Dollár (49 papers)
Ross Girshick (75 papers)
Pieter Noordhuis (2 papers)
Lukasz Wesolowski (2 papers)
Aapo Kyrola (7 papers)
Andrew Tulloch (9 papers)
Yangqing Jia (17 papers)
Kaiming He (71 papers)

Citations (3,514)

View on Semantic Scholar

Summary

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

The paper "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" by Priya Goyal et al. examines the challenges and solutions associated with scaling stochastic gradient descent (SGD) to large minibatches for training deep neural networks efficiently. Focusing on the ImageNet dataset, the authors demonstrate that with appropriate strategies, it is possible to augment the minibatch size significantly without sacrificing generalization performance.

Key Contributions

The paper's central contribution is empirical evidence showing that large minibatches, up to 8192 images, can be used to train deep models without loss in accuracy, provided that certain techniques are employed:

Linear Learning Rate Scaling: The authors introduce a linear scaling rule for learning rates, where the learning rate is scaled linearly with the minibatch size. Specifically, if the minibatch size is multiplied by a factor of $k$ , the learning rate is also multiplied by $k$ .
Warmup Strategy: To address optimization challenges encountered with large minibatches, a gradual warmup of the learning rate is proposed. This involves starting with a smaller learning rate and gradually increasing it to the target level over the first few epochs.
Implementation and Optimization Tips: The paper offers several practical guidelines for effective implementation of large minibatch training, addressing issues such as weight decay, momentum correction, gradient aggregation, and data shuffling.

Results

The experimental results underline the feasibility of the proposed approach. Using the Caffe2 framework combined with the mentioned techniques, the authors manage to train a ResNet-50 model on ImageNet:

Minibatch Size: The model is trained with a minibatch size of 8192 images.
Training Time: The training is accomplished within 1 hour using 256 GPUs.
Accuracy: The resulting model achieves a top-1 error rate consistent with smaller minibatch training, specifically 23.74% ± 0.09%, comparable to the small minibatch baseline of 23.60% ± 0.12%.

Implications and Future Directions

The implications of this work extend to both practical applications and theoretical research in AI:

Practical Implications: The methods described empower training of large-scale models on extensive datasets with dramatically reduced training times. This efficiency paves the way for industrial applications requiring frequent retraining on fresh data and allows researchers to expedite the iteration cycle of exploring new architectures.
Generalization to Other Domains: The techniques are shown to generalize beyond image classification. For instance, the warmup and linear scaling rule are successfully applied to training Mask R-CNN models for object detection and instance segmentation tasks.
Exploration of Limits: Future work can explore the precise boundaries of minibatch scaling efficiency. For instance, the upper limit of the minibatch size for stable training needs further investigation, as the paper notes a breakdown in performance beyond 8192 images.

In conclusion, the paper makes a significant contribution to the field by not only providing a practical solution for efficient large-scale training but also setting a benchmark for future endeavors in distributed deep learning. The methodologies proposed have shown robustness across various tasks and models, indicating their broader applicability and potential to advance the state of the art in AI research.

Related Papers

Find Related Papers

Tweets

https://twitter.com/vincentjliu/status/1763763072043463004

https://twitter.com/HangJoni/status/1820258513891086374

YouTube

Show All Videos