Extrapolated cross-validation for randomized ensembles (2302.13511v3)
Abstract: Ensemble methods such as bagging and random forests are ubiquitous in various fields, from finance to genomics. Despite their prevalence, the question of the efficient tuning of ensemble parameters has received relatively little attention. This paper introduces a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes in randomized ensembles. Our method builds on two primary ingredients: initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique that leverages the structure of prediction risk decomposition. By establishing uniform consistency of our risk extrapolation technique over ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, only requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As a practical case study, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. In comparison to sample-split cross-validation and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. At the same time, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Additional numerical results validate the finite-sample accuracy of ECV for several common ensemble predictors under a computational constraint on the maximum ensemble size.
- Bellec, P.Β C. (2018). Optimal bounds for aggregation of affine estimators. The Annals of Statistics, 46(1):30β59.
- Resampling fewer than nπnitalic_n observations: gains, losses, and remedies for losses. Statistica Sinica, 7(1):1β31.
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123β140.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1):5β32.
- Analyzing bagging. The Annals of Statistics, 30(4):927β961.
- Multiple descent: Design your own generalization curve. arXiv preprint arXiv:2008.01036.
- Robust probabilistic modeling for single-cell multimodal mosaic integration and imputation via scvaeit. Proceedings of the National Academy of Sciences, 119(49):e2214414119.
- Berryβesseen estimates for regenerative processes under weak moment assumptions. Stochastic Processes and their Applications, 129(4):1379β1412.
- Integrated analysis of multimodal single-cell data. Cell, 184(13):3573β3587.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949β986.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Second edition.
- Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models. Nature communications, 9(1):1β10.
- Lei, J. (2020). Cross-validation with confidence. Journal of the American Statistical Association, 115(532):1978β1997.
- The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics.
- Joint learning improves protein abundance prediction in cancers. BMC Biology, 17(1):1β14.
- Reducing sampling ratios and increasing number of estimates improve bagging in sparse regression. In 2019 53rd Annual Conference on Information Sciences and Systems (CISS), pages 1β5. IEEE.
- Lopes, M.Β E. (2019). Estimating the algorithmic variance of randomized ensembles via the bootstrap. The Annals of Statistics, 47(2):1088β1112.
- Measuring the algorithmic convergence of randomized ensembles: The regression setting. SIAM Journal on Mathematics of Data Science, 2(4):921β943.
- Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145β1190.
- Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition, 43(1):143β152.
- Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841β881.
- How many trees in a random forest? In International workshop on machine learning and data mining in pattern recognition.
- Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1β113.
- Mitigating multiple descents: A model-agnostic framework for risk monotonization. arXiv preprint arXiv:2205.12937.
- Rates of convergence for random forests via generalized U-statistics. Electronic Journal of Statistics, 16(1):232β292.
- Politis, D.Β N. (2023). Scalable subsampling: computation, aggregation and inference. Biometrika. asad021.
- Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, pages 2031β2050.
- Pugh, C.Β C. (2002). Real Mathematical Analysis. Springer.
- A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):965β996.
- Rio, E. (2017). About the constants in the fuk-nagaev inequalities. Electronic Communications in Probability, 22(28):12p.
- Consistency of random forests. The Annals of Statistics, pages 1716β1741.
- Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.
- Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228β1242.
- Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625β1651.
- Approximate leave-one-out for fast parameter tuning in high dimensions. In International Conference on Machine Learning.
- Ensemble learning models that predict surface protein abundance from single-cell multimodal omics data. Methods, 189:65β73.
- Surface protein imputation from single cell transcriptomes by deep neural networks. Nature communications, 11(1):651.
- Allen, D.Β M. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16(1):125β127.
- A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40β79.
- Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111.
- Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673.
- Cross-validation confidence intervals for test error. Advances in Neural Information Processing Systems, 33:16339β16350.
- Stability revisited: new generalisation bounds for the leave-one-out. arXiv preprint arXiv:1608.06412.
- Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320β328.
- Extremely randomized trees. Machine learning, 63:3β42.
- Exponential bounds for the hypergeometric distribution. Bernoulli, 23(3):1911.
- Gut, A. (2005). Probability: A Graduate Course. Springer, New York.
- Cross-validation and mean-square stability. In In Proceedings of the Second Symposium on Innovations in Computer Science.
- Near-optimal bounds for cross-validation via loss stability. In International Conference on Machine Learning.
- Lee, A.Β J. (1990). U-statistics: Theory and Practice. Routledge. First edition.
- Classification and regression by randomforest. R news, 2(3):18β22.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825β2830.
- Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions. In International Conference on Artificial Intelligence and Statistics, pages 4067β4077. PMLR.
- Approximate cross-validation in high dimensions with guarantees. In International Conference on Artificial Intelligence and Statistics.
- Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111β133.
- Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64(1):29β35.
- Approximate cross-validation: Guarantees for model assessment and selection. In International Conference on Artificial Intelligence and Statistics.
- Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1):95β112.