Can a Confident Prior Replace a Cold Posterior? (2403.01272v1)
Abstract: Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether posterior tempering can be replaced by a confidence-inducing prior distribution. First, we introduce a "DirClip" prior that is practical to sample and nearly matches the performance of a cold posterior. Second, we introduce a "confidence prior" that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled. Lastly, we provide several general insights into confidence-inducing priors, such as when they might diverge and how fine-tuning can mitigate numerical instability.
- Cold posteriors and aleatoric uncertainty. arXiv preprint arXiv:2008.00029, 2020.
- Aitchison, L. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Rd138pWXMvG.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics, 7(4):434–455, 1998.
- Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683–1691. PMLR, 2014.
- Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
- Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
- Bayesian neural network priors revisited. In Third Symposium on Advances in Approximate Bayesian Inference, 2021. URL https://openreview.net/forum?id=xaqKWHcoOGP.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Dangers of bayesian model averaging under covariate shift. Advances in Neural Information Processing Systems, 34:3309–3322, 2021a.
- What are bayesian neural network posteriors really like? In International Conference on Machine Learning, pp. 4629–4640. PMLR, 2021b.
- On uncertainty, tempering, and data augmentation in bayesian classification. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=pBJe5yu41Pq.
- Learning multiple layers of features from tiny images. 2009.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
- Mackay, D. J. C. Bayesian methods for adaptive models. California Institute of Technology, 1992.
- Data augmentation in bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pp. 1434–1444. PMLR, 2022.
- Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
- Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. Advances in Neural Information Processing Systems, 34:12738–12748, 2021.
- Should we learn most likely functions or parameters? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=9EndFTDiqh.
- Do bayesian neural networks need to be fully stochastic? In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 7694–7722. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/sharma23a.html.
- Tu, K. Modified dirichlet distribution: Allowing negative parameters to induce stronger sparsity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1986–1991, 2016.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688, 2011.
- How good is the Bayes posterior in deep neural networks really? In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10248–10259. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/wenzel20a.html.
- Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
- If there is no underfitting, there is no cold posterior effect. arXiv preprint arXiv:2310.01189, 2023.