A Simple Baseline for Bayesian Uncertainty in Deep Learning (1902.02476v2)

Published 7 Feb 2019 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.

Authors (5)

Wesley Maddox (4 papers)
Timur Garipov (13 papers)
Pavel Izmailov (26 papers)
Dmitry Vetrov (84 papers)
Andrew Gordon Wilson (133 papers)

Citations (738)

View on Semantic Scholar

Summary

A Simple Baseline for Bayesian Uncertainty in Deep Learning

The paper presents SWA-Gaussian (SWAG), an approximate Bayesian inference technique tailored for deep learning contexts. Building upon Stochastic Weight Averaging (SWA), SWAG extends the first moment estimate obtained from SWA by incorporating a Gaussian distribution fitted using both the SWA mean and a low-rank plus diagonal covariance derived from the SWA iterates. This framework is designed to offer scalable and well-calibrated uncertainty estimates across a variety of neural network architectures and datasets.

In an era where Bayesian methods, albeit theoretically robust, have struggled to scale effectively with contemporary deep learning models, SWAG provides a more practical avenue for uncertainty quantification by leveraging the SGD training trajectory. Traditional Bayesian methods often fall short due to high sensitivity to hyperparameters and computational constraints when dealing with large-scale datasets and models. Conversely, SWAG sidesteps these issues by approximating the posterior distribution solely within the subspace spanned by SGD iterates, thus balancing complexity and accuracy.

Key Contributions

SWAG Framework: SWAG adapts SWA to include uncertainty estimates by fitting a Gaussian with a mean and covariance calculated from the SGD training path. This approach capitalizes on the stationary distribution properties of SGD iterates, even though traditional assumptions regarding gradient noise independence and quadratic loss proximity may not fully hold in deep learning settings.
Empirical Performance: Across CIFAR-10, CIFAR-100, and ImageNet datasets, SWAG consistently outperforms or matches state-of-the-art uncertainty estimation techniques like MC-Dropout, K-FAC Laplace, SGLD, and temperature scaling. Notably, it shows significant improvements in tasks requiring out-of-sample detection and transfer learning.
Theoretical and Practical Analysis: The paper explores the geometric properties of the posterior distribution over network parameters, observing that the SGD trajectory indeed captures crucial elements of the loss landscape's geometry. This insight substantiates the use of SWAG in encapsulating the posterior with a Gaussian distribution.

Empirical Evaluation

The empirical analysis underscores SWAG's efficacy across multiple dimensions:

Negative Log Likelihood (NLL): SWAG consistently achieves lower NLL scores compared to competitive approaches, demonstrating its superior predictive performance and uncertainty estimation. For instance, across CIFAR-100 with PreResNet-164, SWAG records an NLL of 0.6595, significantly better than other techniques.
Expected Calibration Error (ECE): SWAG offers excellent calibration metrics, often matching or outperforming methods that specifically target calibration, such as temperature scaling. This facet is particularly pronounced in challenging scenarios like transfer learning from CIFAR-10 to STL-10, where SWAG sustains robust performance without relying on a validation set for tuning.
Out-of-Domain Detection: In tasks involving detection of out-of-domain samples, SWAG performs admirably, highlighting its potential to handle uncertainty in real-world deployment scenarios, which are critical in applications such as autonomous driving and medical diagnostics.

Limitations and Future Directions

While SWAG presents compelling advancements, several theoretical and practical aspects warrant further exploration:

Learning Rate Influence: The covariance approximation's quality is inherently tied to the learning rate used during SGD. Although the paper notes that the theoretically optimal learning rate might not align well with empirical outcomes in deep learning, adjusting this factor using a validation approach or fine-tuning it for specific applications could enhance performance.
Rank and Sample Dependency: The paper evaluates different ranks for the low-rank approximation and the number of samples used for model averaging. While a rank of 20 and 30 samples generally suffice, more comprehensive tuning could optimize results further, particularly in diverse architectures and larger datasets.

Conclusion

SWAG offers a methodologically sound and practically scalable approach to Bayesian uncertainty representation in deep learning. By embedding uncertainty estimation directly into the training process with minimal overhead and robust empirical validation, SWAG stands out as a valuable tool for researchers and practitioners aiming to harness Bayesian methods in large-scale deep learning applications. Future research may further refine SWAG by addressing its dependence on specific hyperparameters and extending its applicability across different neural network paradigms and complex real-world tasks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/norpadon/status/1842650329856262539

https://twitter.com/AdriGarriga/status/1861038669278106030

YouTube

Show All Videos