Papers
Topics
Authors
Recent
Search
2000 character limit reached

On Uncertainty Quantification for Near-Bayes Optimal Algorithms

Published 28 Mar 2024 in stat.ML and cs.LG | (2403.19381v2)

Abstract: Bayesian modelling allows for the quantification of predictive uncertainty which is crucial in safety-critical applications. Yet for many ML algorithms, it is difficult to construct or implement their Bayesian counterpart. In this work we present a promising approach to address this challenge, based on the hypothesis that commonly used ML algorithms are efficient across a wide variety of tasks and may thus be near Bayes-optimal w.r.t. an unknown task distribution. We prove that it is possible to recover the Bayesian posterior defined by the task distribution, which is unknown but optimal in this setting, by building a martingale posterior using the algorithm. We further propose a practical uncertainty quantification method that apply to general ML algorithms. Experiments based on a variety of non-NN and NN algorithms demonstrate the efficacy of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Aitchison, L. (2020a). A statistical theory of cold posteriors in deep neural networks. arXiv preprint arXiv:2008.05912.
  2. Aitchison, L. (2020b). Why bigger is not always better: on finite and infinite neural networks. In International Conference on Machine Learning, pages 156–164. PMLR.
  3. Amari, S.-i. (2016). Information geometry and its applications, volume 194. Springer.
  4. If influence functions are the answer, then what is the question? arXiv:2209.05364 [cs, stat].
  5. Openml benchmarking suites. arXiv preprint arXiv:1708.03731.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  7. Breiman, L. (1996). Bagging predictors. Machine learning, 24:123–140.
  8. Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics, 24(6):2384–2398.
  9. Rates of convergence for sparse variational Gaussian process regression. In International Conference on Machine Learning, pages 862–871. PMLR.
  10. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning, page 18.
  11. Cavalier, L. (2008). Nonparametric statistical inverse problems. Inverse Problems, 24(3):034004.
  12. Interventional and counterfactual inference with diffusion models. arXiv preprint arXiv:2302.00860.
  13. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
  14. On lazy training in differentiable programming. Advances in neural information processing systems, 32.
  15. Repulsive deep ensembles are Bayesian. Advances in Neural Information Processing Systems, 34:3451–3465.
  16. DeepMind, e. a. (2020). The DeepMind JAX Ecosystem.
  17. Aleatory or epistemic? does it matter? Structural safety, 31(2):105–112.
  18. Doob, J. L. (1949). Application of the theory of martingales. Le calcul des probabilites et ses applications, pages 23–27.
  19. Efficient and scalable Bayesian neural nets with rank-1 factors. In International conference on machine learning, pages 2782–2792. PMLR.
  20. Sparse Gaussian processes with spherical harmonic features. In International Conference on Machine Learning, pages 2793–2802. PMLR.
  21. Efron, B. (2012). Bayesian inference and the parametric bootstrap. The annals of applied statistics, 6(4):1971.
  22. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505.
  23. Ferguson, T. S. (1967). Mathematical statistics: A decision theoretic approach. Academic press.
  24. Martingale posterior distributions. arXiv:2103.15671 [math, stat].
  25. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757.
  26. Bayesian neural network priors revisited. In International Conference on Learning Representations.
  27. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
  28. Quasi-Bayesian nonparametric density estimation via autoregressive predictive updates. In Uncertainty in Artificial Intelligence, pages 658–668. PMLR.
  29. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943.
  30. Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems, 35:507–520.
  31. Evaluating scalable Bayesian deep learning methods for robust computer vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 318–319.
  32. Bayesian deep ensembles via the neural tangent kernel. arXiv preprint arXiv:2007.05864.
  33. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
  34. Tabpfn: A transformer that solves small tabular classification problems in a second. In The Eleventh International Conference on Learning Representations.
  35. Statistical inference with exchangeability and martingales. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381(2247):20220143.
  36. What are Bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR.
  37. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31.
  38. Jeffreys, H. (1939). The theory of probability. Oxford University Press.
  39. Causal autoregressive flows. In International conference on artificial intelligence and statistics, pages 3520–3528. PMLR.
  40. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  41. Resampling stochastic gradient descent cheaply for efficient uncertainty quantification. arXiv:2310.11065 [cs, stat].
  42. Martingale posterior neural processes. In The Eleventh International Conference on Learning Representations.
  43. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32.
  44. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855.
  45. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988.
  46. Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in neural information processing systems, 29.
  47. Decoupled weight decay regularization. In International Conference on Learning Representations.
  48. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24.
  49. Data augmentation in Bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pages 1434–1444. PMLR.
  50. Neal, R. (1996). Bayesian learning for neural networks. Lecture Notes in Statistics.
  51. Contraction rates for sparse variational approximations in Gaussian process regression. The Journal of Machine Learning Research, 23(1):9289–9314.
  52. OpenAI (2023). Fine-tuning service. https://platform.openai.com/docs/guides/fine-tuning/.
  53. Randomized prior functions for deep reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8626–8638.
  54. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.
  55. Uncertainty in neural networks: Approximately Bayesian ensembling. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 234–244. PMLR.
  56. Pearl, J. (2009). Causality. Cambridge university press.
  57. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991–999. PMLR.
  58. Implicit regularization in deep learning may not be explainable by norms. Advances in neural information processing systems, 33:21174–21187.
  59. Bayes meets Bernstein at the meta level: an analysis of fast rates in meta-learning with PAC-Bayes. arXiv preprint arXiv:2302.11709.
  60. Doubly stochastic variational inference for deep Gaussian processes. Advances in neural information processing systems, 30.
  61. Vaca: designing variational graph autoencoders for causal queries. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8159–8168.
  62. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90.
  63. Snelson, E. L. (2008). Flexible and efficient Gaussian process models for machine learning. University of London, University College London (United Kingdom).
  64. Steinwart, I. (2019). Convergence types and rates in generic karhunen-loeve expansions with applications to sample path properties. Potential Analysis, 51(3):361–395.
  65. Functional variational Bayesian neural networks. In International Conference on Learning Representations.
  66. Syversveen, A. R. (1998). Noninformative Bayesian priors. interpretation and problems with construction and applications. Preprint statistics, 3(3):1–11.
  67. All you need is a good functional prior for Bayesian deep learning. The Journal of Machine Learning Research, 23(1):3210–3265.
  68. Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
  69. Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collections, 3:200–222.
  70. Villani, C. (2009). Optimal transport: old and new, volume 338. Springer.
  71. Function space particle optimization for Bayesian neural networks. In International Conference on Learning Representations.
  72. BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715.
  73. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405.
  74. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA.
  75. Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
  76. Minimum excess risk in Bayesian learning. IEEE Transactions on Information Theory, 68(12):7935–7955.
  77. Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors. The Journal of Machine Learning Research, 23(1):3426–3471.
  78. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
  79. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on mathematical software (TOMS), 23(4):550–560.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.