Prediction Error Estimation in Random Forests (2309.00736v4)
Abstract: In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.
- Cross-validation: what does it estimate and how well does it do it? Journal of the American Statistical Association.
- Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.
- Bylander, T. (2002). Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning, 48(1-3):287–297. Copyright - Kluwer Academic Publishers 2002.
- Faraway, J. J. (2014). Does data splitting improve prediction? Statistics and Computing, 26(1–2):49–60.
- Random forests: some methodological insights.
- An application of random forests to a genome-wide association dataset: Methodological considerations & new findings. BMC genetics, 11:49.
- Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology, 10(1).
- The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
- An Introduction to Statistical Learning: with Applications in R. Springer.
- On the overestimation of random forest’s out-of-bag error. PLOS ONE, 13(8):1–31.
- Kaggle (2017). The state of data science & machine learning.
- Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research, 17(26):1–41.
- Mitchell, M. (2011). Bias of the random forest out-of-bag (oob) error for certain input parameters. Open Journal of Statistics, 01:205–211.
- Confidence intervals for the generalisation error of random forests.
- Yousef, W. A. (2019). A leisurely look at versions and variants of the cross validation estimator.
- Out-of-bag estimation of the optimal hyperparameter in subbag ensemble method. Communications in Statistics - Simulation and Computation, 39(10):1877–1892.