Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can a Confident Prior Replace a Cold Posterior? (2403.01272v1)

Published 2 Mar 2024 in cs.LG and stat.ML

Abstract: Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether posterior tempering can be replaced by a confidence-inducing prior distribution. First, we introduce a "DirClip" prior that is practical to sample and nearly matches the performance of a cold posterior. Second, we introduce a "confidence prior" that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled. Lastly, we provide several general insights into confidence-inducing priors, such as when they might diverge and how fine-tuning can mitigate numerical instability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Cold posteriors and aleatoric uncertainty. arXiv preprint arXiv:2008.00029, 2020.
  2. Aitchison, L. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Rd138pWXMvG.
  3. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  4. General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics, 7(4):434–455, 1998.
  5. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp.  1683–1691. PMLR, 2014.
  6. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
  7. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  8. Bayesian neural network priors revisited. In Third Symposium on Advances in Approximate Bayesian Inference, 2021. URL https://openreview.net/forum?id=xaqKWHcoOGP.
  9. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  11. Dangers of bayesian model averaging under covariate shift. Advances in Neural Information Processing Systems, 34:3309–3322, 2021a.
  12. What are bayesian neural network posteriors really like? In International Conference on Machine Learning, pp.  4629–4640. PMLR, 2021b.
  13. On uncertainty, tempering, and data augmentation in bayesian classification. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=pBJe5yu41Pq.
  14. Learning multiple layers of features from tiny images. 2009.
  15. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  16. Mackay, D. J. C. Bayesian methods for adaptive models. California Institute of Technology, 1992.
  17. Data augmentation in bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pp.  1434–1444. PMLR, 2022.
  18. Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  19. Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  20. Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. Advances in Neural Information Processing Systems, 34:12738–12748, 2021.
  21. Should we learn most likely functions or parameters? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=9EndFTDiqh.
  22. Do bayesian neural networks need to be fully stochastic? In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp.  7694–7722. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/sharma23a.html.
  23. Tu, K. Modified dirichlet distribution: Allowing negative parameters to induce stronger sparsity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1986–1991, 2016.
  24. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688, 2011.
  25. How good is the Bayes posterior in deep neural networks really? In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  10248–10259. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/wenzel20a.html.
  26. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
  27. If there is no underfitting, there is no cold posterior effect. arXiv preprint arXiv:2310.01189, 2023.
Citations (4)

Summary

  • The paper introduces the DirClip prior, which bounds the prior density to yield a valid posterior and effectively replicate cold posterior performance.
  • The study demonstrates that confidence-inducing priors can control aleatoric uncertainty in BNNs, matching accuracy levels on datasets like CIFAR-10.
  • The research bridges theoretical insights and practical results, highlighting the potential and challenges of replacing cold posteriors with confidence priors.

Exploring Confidence-Inducing Priors as an Alternative to Cold Posteriors in Bayesian Neural Networks

Introduction to Confidence-Inducing Priors

The exploration of tempering in Bayesian neural networks (BNNs) and its implications on fitting training data has been a significant focus in the field. Traditionally, to mitigate underfitting in BNNs when trained on datasets with low label noise, tempering the posterior is a common practice. However, this method has faced scrutiny for deviating from the Bayes posterior and encapsulating an invalid distribution over classes, raising the need for an alternative approach that aligns with the Bayesian framework. This investigation proposes the use of confidence-inducing prior distributions as a potential replacement for posterior tempering, expanding upon the work by Kapoor et al. (2022) on the Dirichlet prior's utility in controlling aleatoric uncertainty within BNNs.

DirClip Prior: Bridging the Gap

The introduction of the DirClip prior presents a novel modification to the traditional Dirichlet prior, aimed at addressing the numerical instability caused by its unbounded density. By clipping (bounding) the prior density, a valid posterior distribution is obtained, which simultaneously manages to control the model's level of aleatoric uncertainty effectively. This strategy closely matches the performance of cold posteriors without deviating from the Bayesian perspective. The implications of this finding suggest that the DirClip prior can serve as a viable alternative to cooling the posterior, providing a more interpretable solution from a Bayesian standpoint.

Beyond the DirClip Prior: Confidence as Prior

While the DirClip prior offers a practical solution, the investigation further explores the conceptualization of a confidence prior. This prior directly enforces low aleatoric uncertainty without the complications of sampling, theoretically justifying the approximation of cold posteriors in the limit of decreasing temperature. Although the confidence prior itself faces challenges in direct application due to the presence of many local maxima, this novel perspective enriches the discussion on cold posteriors, suggesting they approximate a valid prior distribution.

Experimental Insights and Implications

The comparative analysis involving the DirClip and confidence priors against traditional cold posteriors showcases a critical examination of aleatoric uncertainty tuning in BNNs. Specifically, the DirClip prior demonstrates an ability to achieve comparable accuracy levels to cold posteriors on benchmark datasets like CIFAR-10, highlighting its potential for practical application. On the other hand, the theoretical foundation provided by the confidence prior, despite its practical limitations, offers a valuable lens through which the effectiveness of cold posteriors can be understood.

The distinction between prior confidence and posterior confidence is elucidated through experiments, revealing that high confidence in prior samples does not directly translate to high posterior confidence. This observation underscores the complexity of the relationship between prior distributions and their influence on the posterior in BNNs.

Concluding Perspectives

By addressing the limitations associated with cold posteriors and proposing viable alternatives within a Bayesian framework, this research contributes significantly to the ongoing discourse on improving the fit of BNNs to training data. The introduction of the DirClip and confidence priors not only provides practical tools for controlling aleatoric uncertainty but also stimulates further investigation into the theoretical underpinnings of posterior tempering.

As the field of Bayesian deep learning progresses, the insights generated from exploring confidence-inducing priors can guide the development of more interpretable and theoretically grounded methods for uncertainty estimation and model fitting. Future endeavours may benefit from refining these approaches to enhance their practical applicability and understanding the nuanced dynamics between prior and posterior distributions in BNNs.

Acknowledgements and Impact

The acknowledgment of substantial computational resources utilized for this investigation raises important considerations regarding the sustainability of research practices in machine learning. Moving forward, striving for efficient computational methods without compromising on the rigor of statistical analysis will be crucial in balancing innovation with environmental responsibility.

X Twitter Logo Streamline Icon: https://streamlinehq.com