Gene selection with guided regularized random forest (1209.6425v3)

Published 28 Sep 2012 in cs.LG and cs.CE

Abstract: The regularized random forest (RRF) was recently proposed for feature selection by building only one ensemble. In RRF the features are evaluated on a part of the training data at each tree node. We derive an upper bound for the number of distinct Gini information gain values in a node, and show that many features can share the same information gain at a node with a small number of instances and a large number of features. Therefore, in a node with a small number of instances, RRF is likely to select a feature not strongly relevant. Here an enhanced RRF, referred to as the guided RRF (GRRF), is proposed. In GRRF, the importance scores from an ordinary random forest (RF) are used to guide the feature selection process in RRF. Experiments on 10 gene data sets show that the accuracy performance of GRRF is, in general, more robust than RRF when their parameters change. GRRF is computationally efficient, can select compact feature subsets, and has competitive accuracy performance, compared to RRF, varSelRF and LASSO logistic regression (with evaluations from an RF classifier). Also, RF applied to the features selected by RRF with the minimal regularization outperforms RF applied to all the features for most of the data sets considered here. Therefore, if accuracy is considered more important than the size of the feature subset, RRF with the minimal regularization may be considered. We use the accuracy performance of RF, a strong classifier, to evaluate feature selection methods, and illustrate that weak classifiers are less capable of capturing the information contained in a feature subset. Both RRF and GRRF were implemented in the "RRF" R package available at CRAN, the official R package archive.

Citations (277)

View on Semantic Scholar

Summary

The paper introduces guided RRF, enhancing traditional RRF by using preliminary RF importance scores to improve selection accuracy in gene expression data.
It addresses RRF limitations by deriving upper bounds for Gini gain distinct values, reducing misselection in high-dimensional datasets with sparse instances.
Experimental results show GRRF often produces compact feature subsets with competitive classification performance and lower computational costs compared to alternatives.

An Exploration of Gene Selection Using Guided Regularized Random Forest

The paper "Gene Selection With Guided Regularized Random Forest" by Houtao Deng and George Runger introduces an enhancement to the regularized random forest (RRF) called the guided RRF (GRRF). The objective of this paper is to address feature selection challenges in high-dimensional datasets, specifically gene expression data, by improving upon existing methodologies.

Core Contributions

Problem Identification with RRF: The authors identify a key limitation in RRF regarding feature evaluation at tree nodes with a small number of instances. Specifically, they derive an upper bound for the number of distinct Gini information gain values, showing potential pitfalls in situations with a high number of features but limited instances, where less relevant features may be wrongly prioritized due to shared information gain values.
Guided RRF (GRRF) Proposal: The proposed GRRF method utilizes importance scores from a preliminary RF to influence the feature selection processed by RRF. This approach helps to guide and regularize the process, reducing the misselection risk at nodes with sparse data. This effectively mitigates the earlier mentioned issue by considering global feature importance alongside local node-level gain.
Empirical Evaluation: The authors evaluated GRRF against RRF, varSelRF, and LASSO logistic regression on ten distinct gene expression datasets. GRRF demonstrated robust performance, showing more resilience to parameter changes and achieving competitive accuracy with fewer computational resources compared to varSelRF. Specifically, GRRF often selected smaller feature subsets with accuracy either matching or surpassing alternatives.

Key Findings

Feature Reduction and Accuracy:

Experimental results indicate that both RRF and GRRF substantially decrease the number of features while maintaining classification accuracy. This was particularly evident when comparing the RRF selected minimal regularization subset to the set of all features, with the former even outperforming the latter in classification tasks using RF.

Comparison Against Existing Methods:

Compared to varSelRF and LASSO, GRRF showcased superior ability in selecting relevant and non-redundant feature subsets. It proved to be computationally efficient, requiring significantly less processing time than varSelRF due to its streamlined development of only two ensembles instead of multiple iterative constructions.

Robustness to Model Parameters:

The paper provided insights into the parameter sensitivity of GRRF and RRF, noting consistent trends wherein increased regularization (lambda in GRRF context) led to smaller feature subsets. Computational experiments argue for using RF over weaker classifiers during evaluation to fully capture the distinctions in feature selection performance.

Implications and Future Research Directions

The development of GRRF carries significant implications for fields that handle high-dimensional datasets such as genomics, where feature interpretability and computational efficiency are paramount. By integrating preliminary RF-derived importance scores, GRRF represents an adaptive model that addresses both dimensionality reduction and classification accuracy.

Future developments could focus on extending GRRF to accommodate multi-class problems more gracefully and integrating domain-specific knowledge to aid feature pre-selection. Additionally, exploration into adaptive mechanisms for automatic parameter tuning based on data characteristics could enhance the model's applicability and ease of deployment in varied analytic contexts.

In conclusion, this paper contributes a meaningful advancement to gene expression analysis and broader feature selection methods, providing an approach that balances accuracy, computational efficiency, and feature interpretability. Such innovations open pathways for integrating machine learning methodologies into complex biological investigations, reinforcing the synergy between computational advancements and domain-specific applications.

PDF Markdown