Revisiting Small Batch Training for Deep Neural Networks (1804.07612v1)

Published 20 Apr 2018 in cs.LG, cs.CV, and stat.ML

Abstract: Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput. In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. We adopt a learning rate that corresponds to a constant average weight update per gradient calculation (i.e., per unit cost of computation), and point out that this results in a variance of the weight updates that increases linearly with the mini-batch size $m$. The collected experimental results for the CIFAR-10, CIFAR-100 and ImageNet datasets show that increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. On the other hand, small mini-batch sizes provide more up-to-date gradient calculations, which yields more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between $m = 2$ and $m = 32$, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

PDF Abstract

Revisiting Small Batch Training for Deep Neural Networks

This paper examines the efficacy of small batch sizes in the training of deep neural networks, challenging recent trends that favor significantly larger mini-batch sizes. The primary focus is an empirical analysis of how different batch sizes impact training stability and generalization performance across several standard datasets, including CIFAR-10, CIFAR-100, and ImageNet.

Theoretical Considerations

The paper revisits the assumptions underlying learning rate scaling and training duration that are often applied when implementing mini-batch stochastic gradient descent (SGD). It emphasizes the importance of considering the variance in weight updates, which increases linearly with the mini-batch size. Consequently, maintaining a constant average weight update per computational unit demands careful management of this variance, especially as batch size increases.

Experimental Findings

The paper provides a comprehensive set of experimental results using a range of neural network architectures such as AlexNet and ResNet. These experiments clearly demonstrate that small batch sizes, specifically between 2 and 32, offer superior generalization performance compared to batches in the thousands. It is observed that smaller batches enable more real-time gradient calculations, promoting stable and reliable convergence.

Notably, larger batch sizes tend to restrict the range of learning rates that result in stable training. In contrast, smaller batches present a broader spectrum of learning rates that facilitate robust convergence. This supports the hypothesis that larger batch sizes increase the variance of weight updates, posing challenges to stability.

Batch Normalization and Warm-Up Effects

The paper also addresses the interaction between batch normalization (BN) and mini-batch size. For larger networks and datasets, such as ImageNet, BN coupled with smaller mini-batches continues to achieve optimal performance. However, an important nuance is that batch size for BN should often be smaller than that used for SGD weight updates to optimize the network's performance in distributed settings.

In attempts to mitigate the negative effects of large batch sizes, the authors evaluate the gradual warm-up strategy. While this approach improves large-batch performance to some extent, it does not equalize the benefits derived from smaller batches.

Implications and Future Directions

The implications of this paper are multifaceted. Practically, training deep networks with smaller batch sizes can potentially lead to better generalization on unseen data, while also enabling easier tuning of hyperparameters like learning rates. Theoretically, it calls into question the hardware-focused drive for larger batches and highlights a bias towards incorporating more contemporary architectural assumptions that may not leverage small batch benefits.

The paper opens avenues for future research to explore architectures and optimization strategies inherently stable for small batch sizes. Additionally, alternative normalization techniques, such as Group Normalization, could be more suitable for smaller batch regimes, warranting further exploration.

In summary, the paper provides a thorough and measured examination of mini-batch size considerations, contributing valuable insights to the ongoing discourse on training techniques in deep learning.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Dominic Masters (11 papers)
Carlo Luschi (18 papers)

Citations (625)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/eliteplayzXD/status/1797603485766295975