How Good is the Bayes Posterior in Deep Neural Networks Really? (2002.02405v2)

Published 6 Feb 2020 in stat.ML, cs.LG, and stat.CO

Abstract: During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural networks. However, despite this algorithmic progress and the promise of improved uncertainty quantification and sample efficiency there are---as of early 2020---no publicized deployments of Bayesian neural networks in industrial practice. In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions compared to simpler methods including point estimates obtained from SGD. Furthermore, we demonstrate that predictive performance is improved significantly through the use of a "cold posterior" that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments. Our work questions the goal of accurate posterior approximations in Bayesian deep learning: If the true Bayes posterior is poor, what is the use of more accurate approximations? Instead, we argue that it is timely to focus on understanding the origin of the improved performance of cold posteriors.

Citations (330)

View on Semantic Scholar

Summary

The paper scrutinizes Bayesian posteriors in DNNs, showing that traditional Bayes methods underperform compared to SGD point estimates.
It leverages SG-MCMC techniques and diagnostic tools to evaluate inference accuracy while addressing biases from mini-batch noise and sampling methods.
The study highlights the 'cold posterior effect' and suggests that rethinking prior selection could enhance predictive performance in deep learning.

Evaluating the Bayes Posterior in Deep Neural Networks

The paper "How Good is the Bayes Posterior in Deep Neural Networks Really?" rigorously questions the conventional understanding of Bayesian inference within deep neural networks by empirically examining the efficacy of Bayes posteriors compared to other estimation techniques such as Stochastic Gradient Descent (SGD). It critically challenges the grounds on which Bayesian neural networks (BNNs) have been justified, especially regarding posterior predictive performance.

Key Insights and Methodology

The authors utilize Markov Chain Monte Carlo (MCMC) techniques, specifically Stochastic Gradient MCMC (SG-MCMC), to evaluate the posterior derived from Bayesian methods within deep neural networks. Through this, the paper identifies a notable observation: the posterior predictive generated by Bayes methods, surprisingly, underperforms when juxtaposed with simpler point estimates obtained via SGD. An intriguing phenomenon, termed as the "cold posterior effect," is highlighted where lowering the posterior's temperature (thereby deviating from traditional Bayesian inference) enhances predictive accuracy significantly.

To dissect this phenomenon, the paper scrutinizes potential reasons:

Inference Accuracy: It investigates whether the sampling techniques, such as SG-MCMC, could lead to biased inferences due to noise or discretization errors. However, the utilization of diagnostics tools like Kinetic and Configurational temperatures shows that SG-MCMC can simulate posterior dynamics without significant bias.
Posterior Bias: A key hypothesis examined is whether the lack of acceptance-reject steps (common in HMC) introduces biases. The comparisons between HMC and SG-MCMC show analogous performance, thus nullifying this hypothesis.
Mini-batch Noise: The presence of noise in mini-batch gradient estimates was contemplated as a source of inaccurate sampling, but further batch size analyses demonstrated that the cold posterior effect persists across varying batch sizes.

Prior and Likelihood Considerations

The analysis also extends to questioning the role of the prior distributions used. Standard Gaussian priors may inadvertently concentrate probability mass on undesirable functions. Various experiments adjusting the prior variance and examining the influence of priors on the predictive output confirm the inadequacy of the commonly assumed priors.

Furthermore, the paper addresses whether modern deep learning techniques and their associated likelihood functions—potentially contaminated by batch normalization, dropout, etc.—could contribute to the observed divergences from Bayesian posteriors.

Implications and Future Directions

The results pose significant implications on both theoretical and practical fronts. On a theoretical basis, it implies a need to rethink the role of Bayesian inference in the context of deep learning frameworks vis-à-vis traditional probabilistic models.

Practically, the exploration of "cold posteriors" opens avenues to improve predictive accuracy further without strictly adhering to Bayesian formulations. It suggests that more informed prior selection and potentially revised likelihood formulations could bridge the gap currently observed in Bayesian deep learning.

This paper is pivotal as it stimulates contemplation over the inherent assumptions in Bayesian neural networks and ushers for a more nuanced understanding and application of Bayesian principles when dealing with high-capacity, intricate models like deep neural networks. Further inquiry may delve into constructing more expressive priors or hybrid paradigms that can emulate the desired properties of Bayesian inference while ensuring robust predictive performance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gil2rok/status/1787848591135936828

YouTube

Show All Videos