Are Ensembles Getting Better all the Time? (2311.17885v2)
Abstract: Ensemble methods combine the predictions of several base models. We study whether or not including more models always improves their average performance. This question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods such as random forests or deep ensembles. In this setting, we show that ensembles are getting better all the time if, and only if, the considered loss function is convex. More precisely, in that case, the average loss of the ensemble is a decreasing function of the number of models. When the loss function is nonconvex, we show a series of results that can be summarised as: ensembles of good models keep getting better, and ensembles of bad models keep getting worse. To this end, we prove a new result on the monotonicity of tail probabilities that may be of independent interest. We illustrate our results on a medical prediction problem (diagnosing melanomas using neural nets) and a "wisdom of crowds" experiment (guessing the ratings of upcoming movies).
- S. L. Althaus. Collective preferences in democratic politics: Opinion surveys and the will of the people. Cambridge University Press, 2003.
- F. R. Bach. Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th International Conference on Machine Learning, pages 33–40, 2008.
- R. R. Bahadur and R. Ranga Rao. On deviations of the sample mean. The Annals of Mathematical Statistics, 31(4):1015–1027, 1960.
- P. Baldi and P. Sadowski. The dropout learning algorithm. Artificial Intelligence, 210:78–122, 2014.
- O. E. Barndorff-Nielsen. Information and exponential families. Wiley, 1978.
- J. T. Barron. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4331–4339, 2019.
- Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998.
- Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- The combination of forecasts. Journal of the Operational Research Society, 20(4):451–468, 1969.
- The Lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4413–4421, 2018.
- Mathematical statistics: basic ideas and selected topics, volume I (Second edition). Chapman and Hall/CRC, 2015.
- On margins and generalisation for voting classifiers. Advances in Neural Information Processing Systems, 35:9713–9726, 2022.
- L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
- L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- Interval estimation for a binomial proportion. Statistical science, 16(2):101–133, 2001.
- P. Bühlmann and B. Yu. Analyzing bagging. The Annals of Statistics, 30(4):927–961, 2002.
- Importance weighted autoencoders. In International Conference on Learning Representations, 2016.
- T. I. Cannings. Random projections: Data perturbation for classification problems. Wiley Interdisciplinary Reviews: Computational Statistics, 13(1):e1499, 2021.
- Random-projection ensemble classification. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(4):959–1035, 2017.
- Better than the best? answers via model ensemble in density-based clustering. Advances in Data Analysis and Classification, 15:599–623, 2021.
- In question answering, two heads are better than one. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 24–31, 2003.
- R. T. Clemen. Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5(4):559–583, 1989.
- N. Condorcet. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Imprimerie Royale, 1785. Reprinted by Cambridge University Press in 2014.
- O. Dekel and O. Shamir. Vox populi: Collecting high-quality labels from a crowd. In COLT, 2009.
- P. Diaconis. Recent progress on de Finetti’s notions of exchangeability. Bayesian Statistics, 3:111–125, 1988.
- K. L. Dougherty and J. Edward. Odd or even: Assembly size and majority rule. The Journal of politics, 71(2):733–747, 2009.
- W. Feller. An introduction to probability theory and its applications, vol. II. John Wiley & Sons, 1971.
- Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
- J. H. Friedman and P. Hall. On bagging and nonlinear estimation. Journal of Statistical Planning and Inference, 137(3):669–683, 2007.
- Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059. PMLR, 2016.
- F. Galton. Vox populi. Nature, 75(1949):450–451, 1907.
- C. Genest and J. V. Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1(1):114–135, 1986.
- E. I. George. Bayesian model averaging: A tutorial: Comment. Statistical Science, 14(4):409–412, 1999.
- Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(26):787–860, 2015.
- T. Gneiting and A. E. Raftery. Weather forecasting with ensemble methods. Science, 310(5746):248–249, 2005.
- T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
- The misdiagnosis of malignant melanoma. Journal of the American Academy of Dermatology, 40(4):539–548, 1999.
- Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track, 2022.
- P. Hall. The bootstrap and Edgeworth expansion. Springer Science & Business Media, 1992.
- L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, 1990.
- T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
- Bayesian model averaging: a tutorial (with comments by m. clyde, david draper and ei george, and a rejoinder by the authors. Statistical Science, 14(4):382–417, 1999.
- Variational bayesian dropout: pitfalls and fixes. In International Conference on Machine Learning, pages 2019–2028. PMLR, 2018.
- P. J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101, 1964.
- C. Joutard. Multidimensional strong large deviation results. Metrika, 80(6):663–683, 2017.
- A. Koriat. When are two heads better than one and why? Science, 336(6079):360–362, 2012.
- A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learning. Advances in Neural Information Processing Systems, 7, 1994.
- L. I. Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2014.
- Rescuing collective wisdom when the average group opinion is wrong. Frontiers in Robotics and AI, 4:56, 2017.
- K. K. Ladha. Condorcet’s jury theorem in light of de Finetti’s theorem: Majority-rule voting with correlated votes. Social Choice and Welfare, 10:69–85, 1993.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.
- L. Lam and S. Suen. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 27(5):553–568, 1997.
- K. G. Larsen. Bagging is an optimal PAC learner. In Conference on Learning Theory, pages 450–458. PMLR, 2023.
- Packed ensembles for efficient uncertainty estimation. In International Conference on Learning Representations, 2023.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- M. E. Lopes. Estimating the algorithmic variance of randomized ensembles via the bootstrap. The Annals of Statistics, 47(2):1088–1112, 2019.
- M. E. Lopes. Estimating a sharp convergence bound for randomized ensembles. Journal of Statistical Planning and Inference, 204:35–44, 2020.
- Normalized loss functions for deep learning with noisy labels. In International Conference on Machine Learning, pages 6543–6553. PMLR, 2020.
- C. F. Manski. When consensus choice dominates individualism: Jensen’s inequality and collective decisions under uncertainty. Quantitative Economics, 1(1):187–202, 2010.
- A. W. Marshall and F. Proschan. An inequality for convex functions involving majorization. Journal of Mathematical Analysis and Applications, 12(1):87–90, 1965.
- Inequalities: Theory of Majorization and Its Applications. Springer, 2011.
- H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. Advances in Neural Information Processing Systems, 21, 2008.
- On the design of robust classifiers for computer vision. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 779–786. IEEE, 2010.
- Boosting algorithms as gradient descent. Advances in Neural Information Processing Systems, 12, 1999.
- P.-A. Mattei and J. Frellsen. Uphill roads to variational tightness: Monotonicity and Monte Carlo objectives. arXiv preprint arXiv:2201.10989, 2022.
- P. McCullagh. Tensor methods in statistics (Second edition). Dover Books on Mathematics, 2018.
- S. K. McNees. The uses and abuses of ‘consensus’ forecasts. Journal of Forecasting, 11(8):703–710, 1992.
- K. Nikodem. On strongly convex functions and related classes of functions. In Handbook of functional equations, pages 365–405. Springer, 2014.
- Regularizing deep neural networks by noise: Its interpretation and optimization. Advances in Neural Information Processing Systems, 30, 2017.
- Diversity and generalization in neural network ensembles. In International Conference on Artificial Intelligence and Statistics, pages 11720–11743. PMLR, 2022.
- G. Parmigiani and L. Inoue. Decision theory: Principles and approaches. John Wiley & Sons, 2009.
- B. Peleg and S. Zamir. Extending the condorcet jury theorem to a general dependent jury. Social Choice and Welfare, 39:91–125, 2012.
- M. P. Perrone. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. PhD thesis, Brown University, 1993.
- V. V. Petrov. On the probabilities of large deviations for sums of independent random variables. Theory of Probability & Its Applications, 10(2):287–298, 1965.
- A solution to the single-question crowd wisdom problem. Nature, 541(7638):532–535, 2017.
- P. Probst and A.-L. Boulesteix. To tune or not to tune the number of trees in random forest. Journal of Machine Learning Research, 18:1–18, 2018.
- Gaussian processes for machine learning, volume 1. Springer, 2006.
- Deep generative models of genetic variation capture the effects of mutations. Nature methods, 15(10):816–822, 2018.
- L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1–39, 2010.
- Bayesian model averaging in model-based clustering and density estimation. arXiv preprint arXiv:1506.09035, 2015.
- Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
- D. Serre. Matrices. Theory and Applications (Second edition). Springer, 2010.
- S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Majority voting and the condorcet’s jury theorem. arXiv preprint arXiv:2002.03153, 2020.
- Studying the “wisdom of crowds” at scale. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 171–179, 2019.
- C. G. Small. Expansions and asymptotics for statistics. Chapman and Hall/CRC, 2010.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- M. Stone. The opinion pool. The Annals of Mathematical Statistics, pages 1339–1342, 1961.
- Bounding evidence and estimating log-likelihood in vae. arXiv preprint arXiv:2206.09453, 2022.
- Towards efficient feature sharing in mimo architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2697–2701, 2022.
- J. Surowiecki. The wisdom of crowds. Anchor, 2005.
- The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
- T. Viering and M. Loog. The shape of learning curves: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7799–7819, 2023.
- Open problem: Monotonicity of learning. In Conference on Learning Theory, pages 3198–3201. PMLR, 2019.
- MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification. Scientific Data, 10(1):41, 2023.