Bayesian Deep Learning and a Probabilistic Perspective of Generalization (2002.08791v4)

Published 20 Feb 2020 in cs.LG and stat.ML

Abstract: The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. We also investigate the prior over functions implied by a vague distribution over neural network weights, explaining the generalization properties of such models from a probabilistic perspective. From this perspective, we explain results that have been presented as mysterious and distinct to neural network generalization, such as the ability to fit images with random labels, and show that these results can be reproduced with Gaussian processes. We also show that Bayesian model averaging alleviates double descent, resulting in monotonic performance improvements with increased flexibility. Finally, we provide a Bayesian perspective on tempering for calibrating predictive distributions.

Citations (598)

View on Semantic Scholar

Summary

The paper shows that Bayesian marginalization improves neural network accuracy by integrating predictions from multiple high-performing model basins.
The authors propose deep ensembles and MultiSWAG as practical Bayesian approximations that effectively mitigate double descent.
It highlights the role of neural network priors and inductive biases, urging further research to refine Bayesian methods in deep learning.

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

The paper "Bayesian Deep Learning and a Probabilistic Perspective of Generalization" by Andrew Gordon Wilson and Pavel Izmailov explores critical aspects of Bayesian inference in the context of deep learning. The authors present a thorough examination of how Bayesian methods can enhance the understanding and generalization capabilities of neural networks. Their approach provides a probabilistic perspective that challenges conventional views, emphasizing Bayesian marginalization over classical approaches focused on optimization.

Key Contributions

The central premise of the paper is that Bayesian marginalization, rather than optimization, is key to improving accuracy and model calibration in deep neural networks. The authors argue that neural networks often represent multiple high-performing models, and Bayesian marginalization incorporates these various models, leading to enhanced predictions. This view extends beyond the standard practice of measuring model capacity via weight counts and advocates for considering model support and inductive biases.

Deep Ensembles as Bayesian Approximations

A significant contribution of the paper is the presentation of deep ensembles as a practical approximation of Bayesian model averaging (BMA). The authors argue that deep ensembles, though often perceived as separate from Bayesian methods, inherently perform a form of Bayesian marginalization by capturing multiple basins of attraction. This approach can yield a more comprehensive approximation of the Bayesian predictive distribution compared to traditional single basin methods.

The authors introduce MultiSWAG, a novel method that combines multiple independently trained Gaussian approximations to enhance BMA. This method achieves superior performance by marginalizing within diverse basins of attraction and effectively integrates the concept of multimodal marginalization.

Generalization and Double Descent

The paper further investigates generalization from a probabilistic standpoint, highlighting the inadequacy of single-dimensional metrics like parameter counting. Instead, it emphasizes the critical role of model support and inductive biases in defining generalization. A fascinating discussion centers around the phenomenon of double descent, where traditional methods suffer from increased error with model flexibility. The authors demonstrate empirically that Bayesian model averaging, particularly through MultiSWAG, can mitigate double descent, resulting in monotonically improving performance with model complexity.

Neural Network Priors

The paper explores the implications of neural network priors, especially how they influence the distribution over functions. It presents analytic results demonstrating that vague Gaussian priors, despite simplicity, induce beneficial correlation structures over dataset inputs. Such correlations are inferred to provide good inductive biases for common learning tasks.

Practical and Theoretical Implications

From a practical standpoint, this research indicates that applying Bayesian methods, including deep ensembles and the proposed MultiSWAG, can lead to significant gains in accuracy and calibration without substantial computational overhead. Theoretically, it reinforces the view that Bayesian methods offer a robust framework for understanding modern neural networks' generalization capabilities, despite the models’ high flexibility.

Future Directions

The paper suggests that future research should further refine the approximation of BMA in deep learning, exploring more sophisticated methods for navigating complex posterior landscapes. Additionally, exploring diverse priors and continuing to integrate Bayesian ideas with machine learning architectures remain critical areas for development.

Conclusion

Wilson and Izmailov's work presents a compelling narrative on the substantial role of Bayesian inference in understanding and improving generalization in deep learning. By addressing the intricacies of Bayesian marginalization and offering pragmatic solutions like MultiSWAG, the paper sets a new direction for both theoretical exploration and practical application in machine learning.