On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (1609.04836v2)

Published 15 Sep 2016 in cs.LG and math.OC

Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

PDF Abstract

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

The paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima," by Nitish Shirish Keskar et al., addresses a critical issue in deep learning optimization—how large-batch methods tend to yield poorer generalization compared to small-batch methods. Despite achieving similar training accuracy, models trained with large-batch methods often underperform on generalization metrics such as testing accuracy. The authors investigate the reasons behind this generalization gap and provide a quantitative analysis using both theoretical insights and extensive empirical evidence.

Key Observations

The authors observe that large-batch methods tend to converge towards sharp minimizers of the training loss function, while small-batch methods converge to flatter minimizers. This difference is significant because sharp minimizers, characterized by a large number of high positive eigenvalues in the Hessian matrix of the loss function, are closely associated with poorer generalization. In contrast, flat minimizers are more likely to generalize well to unseen data.

The research employs various neural network architectures and datasets to substantiate these claims, focusing on common multi-class classification tasks. Across different network configurations, large-batch methods demonstrated a marked increase in the sharpness of minimizers as opposed to small-batch methods, which maintained relatively low sharpness values.

Experimental Insights

The empirical section of the paper provides a thorough examination of various batch sizes across six network configurations. The authors compute sharpness using a heuristic metric based on exploring a neighborhood around the solution and measuring the largest increase in the loss function. The results unequivocally show that large-batch training leads to higher sharpness values, which in turn correlates to a higher generalization error.

The parametric plots presented in the paper are particularly illustrative, showing the behavior of the loss function and classification accuracy along line segments between small-batch and large-batch solutions. These plots reveal that large-batch minimizers are indeed sharp, validating the hypothesis about the detrimental impact of sharp minima on generalization.

Potential Remedies and Prospects for Future Research

The authors explore several strategies to mitigate the generalization gap observed with large-batch training:

Data Augmentation: By applying extensive data augmentation techniques, the authors were able to improve the testing accuracy of large-batch methods. However, this approach did not sufficiently flatten the sharp minimizers, and thus, the generalization gap was reduced but not entirely closed.
Conservative Training: Implementing a proximal regularization technique to refine the iterates showed promise. While this strategy also improved testing accuracy, it retained the issue of sharp minimizers.
Dynamic Sampling: Gradually increasing the batch size during training, starting from a small batch size and incrementally increasing it, demonstrated some potential. This approach benefited from the exploration capabilities of small-batch methods in the initial epochs, which could guide the optimization towards flatter regions before large batches took over.

Conclusion and Future Work

The implications of this research are twofold. Practically, it highlights the limitations of large-batch training, especially pertinent as training large-scale deep learning models in a distributed manner becomes increasingly common. Theoretically, it opens several avenues for future research. The authors suggest questions such as the provable behavior of large-batch convergence towards sharp minima, the density of sharp vs. flat minimizers, and architectural innovations that might mitigate these issues.

Addressing these limitations could lead to more efficient and generalizable models, significantly impacting the training paradigms for large-scale neural networks. The exploration of robust optimization techniques, advanced data augmentation, and perhaps novel regularization methods could further enhance the viability of large-batch training, making it an area ripe for continued investigation.