Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests (2402.12668v2)
Abstract: We study the often overlooked phenomenon, first noted in \cite{breiman2001random}, that random forests appear to reduce bias compared to bagging. Motivated by an interesting paper by \cite{mentch2020randomization}, where the authors argue that random forests reduce effective degrees of freedom and only outperform bagging ensembles in low signal-to-noise ratio (SNR) settings, we explore how random forests can uncover patterns in the data missed by bagging. We empirically demonstrate that in the presence of such patterns, random forests reduce bias along with variance and increasingly outperform bagging ensembles when SNR is high. Our observations offer insights into the real-world success of random forests across a range of SNRs and enhance our understanding of the difference between random forests and bagging ensembles with respect to the randomization injected into each split. Our investigations also yield practical insights into the importance of tuning $mtry$ in random forests.
- Predicting ipo initial returns using random forest. Borsa Istanbul Review, 20(1):13–23, 2020.
- Automated trading with performance weighted random forests and seasonality. Expert Systems with Applications, 41(8):3651–3661, 2014.
- Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
- Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
- Leo Breiman. Classification and regression trees. Routledge, 2017.
- Analyzing bagging. The annals of Statistics, 30(4):927–961, 2002.
- An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning, pages 161–168, 2006.
- Fault diagnosis in spur gears based on genetic algorithm and random forest. Mechanical Systems and Signal Processing, 70:87–103, 2016.
- Ramón Díaz-Uriarte and Sara Alvarez de Andrés. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7:1–13, 2006.
- Pedro Domingos. A unified bias-variance decomposition. In Proceedings of 17th international conference on machine learning, pages 231–238. Morgan Kaufmann Stanford, 2000.
- Pedro M Domingos. Why does bagging work? a bayesian account and its implications. In KDD, pages 155–158. Citeseer, 1997.
- Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, pages 54–75, 1986.
- Jerome H Friedman. Multivariate adaptive regression splines. The annals of statistics, 19(1):1–67, 1991.
- Jerome H Friedman. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1:55–77, 1997.
- Yves Grandvalet. Bagging equalizes influence. Machine Learning, 55:251–270, 2004.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017.
- Financial fraud detection model: Based on random forest. International journal of economics and finance, 7(7), 2015.
- Classification of intraday s&p500 returns with a random forest. International Journal of Forecasting, 35(1):390–407, 2019.
- Rahul Mazumder. Discussion of “best subset, forward stepwise or lasso? analysis and recommendations based on extensive comparisons”. Statistical Science, 35(4), 2020.
- Randomization as regularization: A degrees of freedom explanation for random forest success. The Journal of Machine Learning Research, 21(1):6918–6953, 2020.
- Smartphone-based real-time classification of noise signals using subband features and random forest classifier. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2204–2208. IEEE, 2016.
- Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research, 18(1):1558–1590, 2017.
- The research progress and prospect of data mining methods on corrosion prediction of oil and gas pipelines. Engineering Failure Analysis, 144:106951, 2023.