Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extrapolated cross-validation for randomized ensembles (2302.13511v3)

Published 27 Feb 2023 in stat.ME and stat.ML

Abstract: Ensemble methods such as bagging and random forests are ubiquitous in various fields, from finance to genomics. Despite their prevalence, the question of the efficient tuning of ensemble parameters has received relatively little attention. This paper introduces a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes in randomized ensembles. Our method builds on two primary ingredients: initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique that leverages the structure of prediction risk decomposition. By establishing uniform consistency of our risk extrapolation technique over ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, only requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As a practical case study, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. In comparison to sample-split cross-validation and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. At the same time, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Additional numerical results validate the finite-sample accuracy of ECV for several common ensemble predictors under a computational constraint on the maximum ensemble size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Bellec, P.Β C. (2018). Optimal bounds for aggregation of affine estimators. The Annals of Statistics, 46(1):30–59.
  2. Resampling fewer than n𝑛nitalic_n observations: gains, losses, and remedies for losses. Statistica Sinica, 7(1):1–31.
  3. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140.
  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
  5. Analyzing bagging. The Annals of Statistics, 30(4):927–961.
  6. Multiple descent: Design your own generalization curve. arXiv preprint arXiv:2008.01036.
  7. Robust probabilistic modeling for single-cell multimodal mosaic integration and imputation via scvaeit. Proceedings of the National Academy of Sciences, 119(49):e2214414119.
  8. Berry–esseen estimates for regenerative processes under weak moment assumptions. Stochastic Processes and their Applications, 129(4):1379–1412.
  9. Integrated analysis of multimodal single-cell data. Cell, 184(13):3573–3587.
  10. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986.
  11. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Second edition.
  12. Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models. Nature communications, 9(1):1–10.
  13. Lei, J. (2020). Cross-validation with confidence. Journal of the American Statistical Association, 115(532):1978–1997.
  14. The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics.
  15. Joint learning improves protein abundance prediction in cancers. BMC Biology, 17(1):1–14.
  16. Reducing sampling ratios and increasing number of estimates improve bagging in sparse regression. In 2019 53rd Annual Conference on Information Sciences and Systems (CISS), pages 1–5. IEEE.
  17. Lopes, M.Β E. (2019). Estimating the algorithmic variance of randomized ensembles via the bootstrap. The Annals of Statistics, 47(2):1088–1112.
  18. Measuring the algorithmic convergence of randomized ensembles: The regression setting. SIAM Journal on Mathematics of Data Science, 2(4):921–943.
  19. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190.
  20. Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition, 43(1):143–152.
  21. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881.
  22. How many trees in a random forest? In International workshop on machine learning and data mining in pattern recognition.
  23. Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319):1–113.
  24. Mitigating multiple descents: A model-agnostic framework for risk monotonization. arXiv preprint arXiv:2205.12937.
  25. Rates of convergence for random forests via generalized U-statistics. Electronic Journal of Statistics, 16(1):232–292.
  26. Politis, D.Β N. (2023). Scalable subsampling: computation, aggregation and inference. Biometrika. asad021.
  27. Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, pages 2031–2050.
  28. Pugh, C.Β C. (2002). Real Mathematical Analysis. Springer.
  29. A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):965–996.
  30. Rio, E. (2017). About the constants in the fuk-nagaev inequalities. Electronic Communications in Probability, 22(28):12p.
  31. Consistency of random forests. The Annals of Statistics, pages 1716–1741.
  32. Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.
  33. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
  34. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651.
  35. Approximate leave-one-out for fast parameter tuning in high dimensions. In International Conference on Machine Learning.
  36. Ensemble learning models that predict surface protein abundance from single-cell multimodal omics data. Methods, 189:65–73.
  37. Surface protein imputation from single cell transcriptomes by deep neural networks. Nature communications, 11(1):651.
  38. Allen, D.Β M. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16(1):125–127.
  39. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79.
  40. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111.
  41. Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673.
  42. Cross-validation confidence intervals for test error. Advances in Neural Information Processing Systems, 33:16339–16350.
  43. Stability revisited: new generalisation bounds for the leave-one-out. arXiv preprint arXiv:1608.06412.
  44. Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320–328.
  45. Extremely randomized trees. Machine learning, 63:3–42.
  46. Exponential bounds for the hypergeometric distribution. Bernoulli, 23(3):1911.
  47. Gut, A. (2005). Probability: A Graduate Course. Springer, New York.
  48. Cross-validation and mean-square stability. In In Proceedings of the Second Symposium on Innovations in Computer Science.
  49. Near-optimal bounds for cross-validation via loss stability. In International Conference on Machine Learning.
  50. Lee, A.Β J. (1990). U-statistics: Theory and Practice. Routledge. First edition.
  51. Classification and regression by randomforest. R news, 2(3):18–22.
  52. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  53. Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions. In International Conference on Artificial Intelligence and Statistics, pages 4067–4077. PMLR.
  54. Approximate cross-validation in high dimensions with guarantees. In International Conference on Artificial Intelligence and Statistics.
  55. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111–133.
  56. Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64(1):29–35.
  57. Approximate cross-validation: Guarantees for model assessment and selection. In International Conference on Artificial Intelligence and Statistics.
  58. Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1):95–112.
Citations (3)

Summary

We haven't generated a summary for this paper yet.