Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning (2405.09536v2)
Abstract: Gradient boosting is a sequential ensemble method that fits a new weaker learner to pseudo residuals at each iteration. We propose Wasserstein gradient boosting, a novel extension of gradient boosting that fits a new weak learner to alternative pseudo residuals that are Wasserstein gradients of loss functionals of probability distributions assigned at each input. It solves distribution-valued supervised learning, where the output values of the training dataset are probability distributions for each input. In classification and regression, a model typically returns, for each input, a point estimate of a parameter of a noise distribution specified for a response variable, such as the class probability parameter of a categorical distribution specified for a response label. A main application of Wasserstein gradient boosting in this paper is tree-based evidential learning, which returns a distributional estimate of the response parameter for each input. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with existing uncertainty quantification methods.
- Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021.
- Eric Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25:44–56, 2019.
- A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
- Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, page 521–530, 2007.
- Christopher Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11, 2010.
- Boosted decision trees as an alternative to artificial neural networks for particle identification. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543(2):577–584, 2005.
- The Netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3–6, 2007.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059. PMLR, 2016.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, volume 31, 2018.
- A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56:1513–1589, 2023.
- Deep evidential regression. In Advances in Neural Information Processing Systems, volume 33, pages 14927–14937, 2020.
- Posterior network: Uncertainty estimation without OOD samples via density-based pseudo-counts. In Advances in Neural Information Processing Systems, volume 33, pages 1356–1367. Curran Associates, Inc., 2020.
- Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transactions on Machine Learning Research, 2023.
- Evidential deep learning for arbitrary lidar object classification in the context of autonomous driving. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 1304–1311, 2019.
- Deal: Deep evidential active learning for image classification. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 865–870, 2020.
- Evidential deep learning for guided molecular property prediction and discovery. ACS Central Science, 7(8):1356–1367, 2021.
- An advanced Dirichlet prior network for out-of-distribution detection in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
- Bayesian Data Analysis. Chapman and Hall/CRC, 3rd ed. edition, 2013.
- The Elements of Statistical Learning. Springer New York, 2009.
- XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, 2016.
- LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30, 2017.
- C’edric Villani. Topics in Optimal Transportation. Americal Mathematical Society, 2003.
- Gradient Flows In Metric Spaces and in the Space of Probability Measures. Birkhäuser Basel, 2005.
- Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7:87–154, 2017.
- Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, volume 30, 2017.
- A blob method for diffusion. Calculus of Variations and Partial Differential Equations, 58(53), 2019.
- Projected Wasserstein gradient descent for high-dimensional Bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 10(4):1513–1532, 2022.
- Interacting particle solutions of Fokker-Planck equations through gradient-log-density estimation. Entropy (Basel), 22(8):802, 2020.
- Regularized Stein variational gradient flow. arXiv:2211.07861, 2022.
- Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
- NGBoost: Natural gradient boosting for probabilistic prediction. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2690–2700. PMLR, 2020.
- Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems, volume 12, 1999.
- Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4):477 – 505, 2007.
- Why do tree-based models still outperform deep learning on typical tabular data? In Advances in Neural Information Processing Systems, volume 35, pages 507–520, 2022.
- Benchmarking state-of-the-art gradient boosting algorithms for classification. arXiv:2305.17094, 2023.
- GBDT-MO: Gradient-boosted decision trees for multiple outputs. IEEE Transactions on Neural Networks and Learning Systems, 32(7):3156–3167, 2021.
- Efficient Second-Order Gradient Boosting for Conditional Random Fields. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 147–155. PMLR, 2015.
- Classification and Regression Trees. Chapman and Hall/CRC, 1984.
- Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
- A Stein variational Newton method. In Advances in Neural Information Processing Systems, volume 31, 2018.
- Information Newton’s flow: second-order optimization method in probability space. arXiv:2001.04341, 2020.
- Malay Ghosh. Objective priors: An introduction for frequentists. Statistical Science, 26(2):187–202, 2011.
- J. Aitchison and S. M. Shen. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261–272, 1980.
- Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, volume 29, 2016.
- Stein variational message passing for continuous graphical models. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5219–5227. PMLR, 2018.
- Stein variational model predictive control. In Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1278–1297. PMLR, 2021.
- A non-asymptotic analysis for Stein variational gradient descent. In Advances in Neural Information Processing Systems, volume 33, pages 4672–4682, 2020.
- SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems, volume 33, pages 2098–2109, 2020.
- Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 2093–3027. PMLR, 2018.
- Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 – 363, 1996.
- Probabilistic backpropagation for scalable learning of Bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1861–1869. PMLR, 2015.
- Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, 1985.
- Concrete dropout. In Advances in Neural Information Processing Systems, volume 30, 2017.
- UCI machine learning repository, 2017.
- Generalized out-of-distribution detection: A survey. arXiv:2110.11334, 2024.
- Ensemble distribution distillation. In International Conference on Learning Representations, 2020.
- Filippo Santambrogio. Optimal Transport for Applied Mathematicians. Birkhäuser Cham, 2015.
- Bridging the gap between variational inference and Wasserstein gradient flows, 2023.
- Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, volume 32, 2019.
- The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
- Grigorios A. Pavliotis. Stochastic Processes and Applications. Springer New York, 2014.
- An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge University Press, 2016.
- A stochastic Stein variational Newton method. arXiv:2204.09039, 2022.
- A Hilbert space embedding for distributions.
- Shun ichi Amari. Information Geometry and Its Applications. Springer Tokyo, 2016.