Averaging Weights Leads to Wider Optima and Better Generalization (1803.05407v3)

Published 14 Mar 2018 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

PDF Abstract

Averaging Weights Leads to Wider Optima and Better Generalization

The paper "Averaging Weights Leads to Wider Optima and Better Generalization" addresses significant aspects of training deep neural networks (DNNs), proposing the Stochastic Weight Averaging (SWA) method. This method capitalizes on the geometrical properties of the loss surfaces to improve generalization, suggesting that simpler averaging of neural network weights can lead to better performance and more robust models.

Key Contributions and Findings

Stochastic Weight Averaging (SWA): SWA involves averaging weights from multiple points along the training trajectory of Stochastic Gradient Descent (SGD) with cyclical or constant learning rates. This approach finds flatter regions in the loss surface, leading to better generalization in comparison to conventional methods.
Improvement in Generalization: The paper demonstrates that SWA achieves notable improvements in test accuracy over conventional SGD across a variety of state-of-the-art neural network architectures, including residual networks, PyramidNets, DenseNets, and Shake-Shake networks on datasets like CIFAR-10, CIFAR-100, and ImageNet.
Relationship to Fast Geometric Ensembling (FGE): The authors show that SWA approximates the benefits of FGE without the added computational burden of ensembling, by interpreting the ensemble as an averaging of networks in weight space rather than model space.
Flatness of Solutions: The proposed SWA method finds solutions that are wider and flatter compared to those found by SGD. This flatness is theorized to be critically related to better generalization performance, aligning with arguments by Keskar et al. (2017) and Hochreiter & Schmidhuber (1997).
Minimal Computational Overhead: One of the strengths of SWA is its simplicity and minimal computational overhead. Compared to conventional training techniques, SWA requires only maintaining a simple running average of the weights over epochs.

Numerical Results and Significance

The authors present strong numerical results:

On CIFAR-100, SWA improves test accuracy by more than 1.3% for several architectures, including Preactivation ResNet-164 and Wide ResNet-28-10.
On ImageNet, SWA achieved an improvement of up to 0.8% in top-1 accuracy for ResNet-50 and DenseNet-161, with just 10 additional epochs.

These results underscore the practical applicability of SWA across different architectures and datasets, suggesting that it can serve as a powerful, architecture-agnostic tool for deep learning practitioners.

Practical and Theoretical Implications

Practically, SWA's ease of implementation and compatibility with existing frameworks (like PyTorch) make it an attractive addition to the deep learning toolbox. The method's ability to improve generalization with negligible additional computational costs can have significant impacts in both research and deployment scenarios where resource efficiency is crucial.

Theoretically, the paper enriches our understanding of the geometry of DNN loss landscapes. By showing that SGD typically converges to sharp optima while SWA navigates towards flatter regions, the authors contribute to the ongoing discourse on the relationship between flatness of minima and generalization. This insight could guide the future development of optimization algorithms that are even more robust against overfitting.

Future Directions

The paper opens several avenues for future research:

Convergence Analysis: A deeper exploration into the convergence properties of SWA, particularly in non-convex settings.
Large Batch Training: Investigating the effects of SWA on generalization with larger batch sizes, potentially enabling more efficient training regimes.
Integration with Bayesian Methods: Combining SWA with Bayesian neural networks and stochastic MCMC approaches to leverage posterior density exploration.
Optimizing Learning Rate Schedules: Fine-tuning cyclic and constant learning rate schedules to further enhance the convergence rates and robustness of SWA.

The introduction of SWA marks a significant advancement in training neural networks, offering a straightforward yet powerful modification that can be seamlessly integrated into existing training pipelines. By continuing to refine and understand the properties of SWA, researchers can unlock further potential in the performance and reliability of DNNs.