Heterogeneous Random Forest
Abstract: Random forest (RF) stands out as a highly favored machine learning approach for classification problems. The effectiveness of RF hinges on two key factors: the accuracy of individual trees and the diversity among them. In this study, we introduce a novel approach called heterogeneous RF (HRF), designed to enhance tree diversity in a meaningful way. This diversification is achieved by deliberately introducing heterogeneity during the tree construction. Specifically, features used for splitting near the root node of previous trees are assigned lower weights when constructing the feature sub-space of the subsequent trees. As a result, dominant features in the prior trees are less likely to be employed in the next iteration, leading to a more diverse set of splitting features at the nodes. Through simulation studies, it was confirmed that the HRF method effectively mitigates the selection bias of trees within the ensemble, increases the diversity of the ensemble, and demonstrates superior performance on datasets with fewer noise features. To assess the comparative performance of HRF against other widely adopted ensemble methods, we conducted tests on 52 datasets, comprising both real-world and synthetic data. HRF consistently outperformed other ensemble methods in terms of accuracy across the majority of datasets.
- Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17:2-3:255–287, 2011.
- Harald Binder. Computer age statistical inference b. efron t. hastie (2016). new york, ny: Cambridge university press. 475 pages, isbn 978-1-107-14989-2. Biometrical Journal, 60(1):220–221, 2018.
- Leo Breiman. Bagging predictors. Machine Learning, 24(2):2010–2014, 07 1996.
- Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- Enriched random forests. Bioinformatics, 24(18):2010–2014, 07 2008.
- A feature set decomposition method for the construction of multi-classifier systems trained with high-dimensional data. In José Ruiz-Shulcloper and Gabriella Sanniti di Baja, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 278–285, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
- Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.
- Canonical forest. Computational Statistics, 29(3):849–867, 2014.
- Fuzzy forests: Extending random forests for correlated, high-dimensional data. UCLA: Department of Biostatistics, 2015.
- Fighting biases with dynamic boosting. CoRR, abs/1706.09516, 2017.
- Jerome Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38:367–378, 02 2002.
- Extremely randomized trees. Machine Learning, 63(1):3–42, 2006.
- Double random forest. Machine Learning, 109(8):1569–1586, 2020.
- Exploring relationships in body dimensions. Journal of Statistics Education, 11(2), 2003.
- A weight-adjusted voting algorithm for ensembles of classifiers. Journal of the Korean Statistical Society, 40(4):437–449, 2011.
- Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454):589–604, 2001.
- A review of feature set partitioning methods for multi-view ensemble learning. Information Fusion, 100:101959, 2023.
- Machine learning benchmark problems. R Package, mlbench, 2(1), 2010.
- Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition, 46(3):769–787, 2013.
- A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine learning, 40:203–228, 2000.
- Wei-Yin Loh. Improving the precision of classification trees. The Annals of Applied Statistics, 3(4):1710 – 1737, 2009.
- The UCI Machine Learning Repository.
- Category encoders: a scikit-learn-contrib package of transformers for encoding categorical data. Journal of Open Source Software, 3(21):501, 2018.
- Input decimated ensemble based on neighborhood preserving embedding for spectrogram classification. Expert Systems with Applications, 36(8):11257–11261, 2009.
- An efficient random forests algorithm for high dimensional data classification. Advances in Data Analysis and Classification, 12:953–972, 2018.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Quantitative analysis of literary styles. The American Statistician, 56(3):175–185, 2002.
- Canonical correlation forests, 2017.
- Wilcoxon-Signed-Rank Test, pages 1658–1659. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
- Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619–1630, 2006.
- Lior Rokach. Genetic algorithm-based feature set partitioning for classification problems. Pattern Recognition, 41(5):1676–1700, 2008.
- Lior Rokach. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genetics, 19(1):65, 2008.
- The distribution of chi-square. Proceedings of the National Academy of Sciences, 17(12):684–688, 1931.
- BaoXun Xu. Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining, 8(2):44–63, 2012.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.