Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning (2405.09536v2)

Published 15 May 2024 in stat.ME, cs.LG, and stat.ML

Abstract: Gradient boosting is a sequential ensemble method that fits a new weaker learner to pseudo residuals at each iteration. We propose Wasserstein gradient boosting, a novel extension of gradient boosting that fits a new weak learner to alternative pseudo residuals that are Wasserstein gradients of loss functionals of probability distributions assigned at each input. It solves distribution-valued supervised learning, where the output values of the training dataset are probability distributions for each input. In classification and regression, a model typically returns, for each input, a point estimate of a parameter of a noise distribution specified for a response variable, such as the class probability parameter of a categorical distribution specified for a response label. A main application of Wasserstein gradient boosting in this paper is tree-based evidential learning, which returns a distributional estimate of the response parameter for each input. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with existing uncertainty quantification methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  2. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021.
  3. Eric Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25:44–56, 2019.
  4. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
  5. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, page 521–530, 2007.
  6. Christopher Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11, 2010.
  7. Boosted decision trees as an alternative to artificial neural networks for particle identification. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543(2):577–584, 2005.
  8. The Netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3–6, 2007.
  9. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059. PMLR, 2016.
  10. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, 2017.
  11. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, volume 31, 2018.
  12. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56:1513–1589, 2023.
  13. Deep evidential regression. In Advances in Neural Information Processing Systems, volume 33, pages 14927–14937, 2020.
  14. Posterior network: Uncertainty estimation without OOD samples via density-based pseudo-counts. In Advances in Neural Information Processing Systems, volume 33, pages 1356–1367. Curran Associates, Inc., 2020.
  15. Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transactions on Machine Learning Research, 2023.
  16. Evidential deep learning for arbitrary lidar object classification in the context of autonomous driving. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 1304–1311, 2019.
  17. Deal: Deep evidential active learning for image classification. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 865–870, 2020.
  18. Evidential deep learning for guided molecular property prediction and discovery. ACS Central Science, 7(8):1356–1367, 2021.
  19. An advanced Dirichlet prior network for out-of-distribution detection in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
  20. Bayesian Data Analysis. Chapman and Hall/CRC, 3rd ed. edition, 2013.
  21. The Elements of Statistical Learning. Springer New York, 2009.
  22. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, 2016.
  23. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30, 2017.
  24. C’edric Villani. Topics in Optimal Transportation. Americal Mathematical Society, 2003.
  25. Gradient Flows In Metric Spaces and in the Space of Probability Measures. Birkhäuser Basel, 2005.
  26. Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7:87–154, 2017.
  27. Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, volume 30, 2017.
  28. A blob method for diffusion. Calculus of Variations and Partial Differential Equations, 58(53), 2019.
  29. Projected Wasserstein gradient descent for high-dimensional Bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 10(4):1513–1532, 2022.
  30. Interacting particle solutions of Fokker-Planck equations through gradient-log-density estimation. Entropy (Basel), 22(8):802, 2020.
  31. Regularized Stein variational gradient flow. arXiv:2211.07861, 2022.
  32. Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
  33. NGBoost: Natural gradient boosting for probabilistic prediction. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2690–2700. PMLR, 2020.
  34. Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems, volume 12, 1999.
  35. Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4):477 – 505, 2007.
  36. Why do tree-based models still outperform deep learning on typical tabular data? In Advances in Neural Information Processing Systems, volume 35, pages 507–520, 2022.
  37. Benchmarking state-of-the-art gradient boosting algorithms for classification. arXiv:2305.17094, 2023.
  38. GBDT-MO: Gradient-boosted decision trees for multiple outputs. IEEE Transactions on Neural Networks and Learning Systems, 32(7):3156–3167, 2021.
  39. Efficient Second-Order Gradient Boosting for Conditional Random Fields. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 147–155. PMLR, 2015.
  40. Classification and Regression Trees. Chapman and Hall/CRC, 1984.
  41. Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
  42. A Stein variational Newton method. In Advances in Neural Information Processing Systems, volume 31, 2018.
  43. Information Newton’s flow: second-order optimization method in probability space. arXiv:2001.04341, 2020.
  44. Malay Ghosh. Objective priors: An introduction for frequentists. Statistical Science, 26(2):187–202, 2011.
  45. J. Aitchison and S. M. Shen. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261–272, 1980.
  46. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, volume 29, 2016.
  47. Stein variational message passing for continuous graphical models. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5219–5227. PMLR, 2018.
  48. Stein variational model predictive control. In Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1278–1297. PMLR, 2021.
  49. A non-asymptotic analysis for Stein variational gradient descent. In Advances in Neural Information Processing Systems, volume 33, pages 4672–4682, 2020.
  50. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems, volume 33, pages 2098–2109, 2020.
  51. Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 2093–3027. PMLR, 2018.
  52. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 – 363, 1996.
  53. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1861–1869. PMLR, 2015.
  54. Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, 1985.
  55. Concrete dropout. In Advances in Neural Information Processing Systems, volume 30, 2017.
  56. UCI machine learning repository, 2017.
  57. Generalized out-of-distribution detection: A survey. arXiv:2110.11334, 2024.
  58. Ensemble distribution distillation. In International Conference on Learning Representations, 2020.
  59. Filippo Santambrogio. Optimal Transport for Applied Mathematicians. Birkhäuser Cham, 2015.
  60. Bridging the gap between variational inference and Wasserstein gradient flows, 2023.
  61. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, volume 32, 2019.
  62. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
  63. Grigorios A. Pavliotis. Stochastic Processes and Applications. Springer New York, 2014.
  64. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge University Press, 2016.
  65. A stochastic Stein variational Newton method. arXiv:2204.09039, 2022.
  66. A Hilbert space embedding for distributions.
  67. Shun ichi Amari. Information Geometry and Its Applications. Springer Tokyo, 2016.

Summary

  • The paper introduces WGBoost, a novel extension of gradient boosting that uses Wasserstein gradients to predict full probability distributions.
  • It demonstrates robust performance in conditional density estimation and out-of-distribution detection, excelling on metrics like NLL and RMSE.
  • The approach enhances uncertainty quantification, offering more reliable predictions for applications in high-stakes domains such as medical diagnostics and autonomous driving.

Introducing Wasserstein Gradient Boosting: Enhancing Predictive Uncertainty in Gradient Boosting

Gradient Boosting is a popular machine learning method, especially useful with tabular data. However, traditional gradient boosting techniques often focus on point predictions or probabilistic classification, with less attention given to capturing predictive uncertainty. This is crucial for fields like medical diagnostics and autonomous driving where assessing risks and predictions' uncertainty can make a huge difference.

What is Wasserstein Gradient Boosting?

The paper presents a new technique called Wasserstein Gradient Boosting (WGBoost). This is an extension of gradient boosting that fits new base learners (typically decision trees) to the Wasserstein gradient of a loss function over probability distributions. Simply put, WGBoost aims to better approximate and predict entire probability distributions rather than just point estimates.

This approach is useful for "posterior regression," where the goal is to model the distribution of a parameter given past data.

Key Highlights

General Methodology

WGBoost builds on gradient boosting by:

  1. Introducing a loss functional that measures the divergence between a predicted distribution and a target distribution.
  2. Training base learners to approximate the steepest descent direction (Wasserstein gradient) of this functional.

The algorithm outputs a set of particles that approximate the target distribution at each input. This is particularly fitting for applications requiring high predictive uncertainty.

Numerical Results

The paper provides comprehensive empirical results:

  1. Conditional Density Estimation: WGBoost effectively captures the variability in data, even with complex distribution shapes.
  2. Probabilistic Regression Benchmarking: WGBoost often matches or exceeds the performance of other state-of-the-art methods across a variety of datasets, particularly in terms of negative log likelihood (NLL) and root mean square error (RMSE).
  3. Classification and Out-of-Distribution (OOD) Detection: WGBoost demonstrates strong classification accuracy while also excelling in OOD detection, a critical capability for identifying when an input sample markedly deviates from the training data.

Practical and Theoretical Implications

Practical Implications

The key benefit of WGBoost is its ability to provide a distributional prediction rather than a single point estimate. This improvement offers:

  • Enhanced robustness: Predictions take into account distributional data, offering more reliable outputs.
  • Improved uncertainty estimates: Beneficial for fields where understanding the confidence of predictions is critical (e.g., medical applications).

Theoretical Implications

WGBoost extends gradient boosting by incorporating Wasserstein gradients, thereby opening new avenues to utilize strong mathematical frameworks from optimal transport in machine learning. This can inspire further research in:

  • Advanced loss functionals: Tailoring them to specific applications.
  • Cross-disciplinary applications: Using WGBoost in fields like computational finance, climate modeling, and more where uncertainty quantification is vital.

Future Developments

Moving forward, WGBoost could see enhancements such as:

  • Hybrid models that integrate other machine learning paradigms.
  • Further scalability improvements to handle larger datasets seamlessly.
  • Expansions in automated machine learning (AutoML) frameworks to leverage WGBoost without manual tuning.

Overall, the paper presents a compelling case for the adoption of Wasserstein Gradient Boosting, highlighting its strengths and potential for future enhancements in predictive modeling. Whether in academia or industry, WGBoost offers a promising pathway to more reliable and interpretable machine learning models.