The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures (1101.5008v2)

Published 26 Jan 2011 in q-bio.QM, stat.AP, and stat.ML

Abstract: Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/.

Citations (365)

View on Semantic Scholar

Summary

The paper demonstrates that simple univariate filters, notably the Student's t-test, yield higher accuracy and stability in gene signature selection.
It compares filter, wrapper, and embedded methods across breast cancer gene expression datasets, highlighting performance metrics like AUC and robustness in gene lists.
The findings advocate for prioritizing interpretable and stable feature selection techniques over complex, computationally intensive approaches in biomarker discovery.

The Influence of Feature Selection Methods on Molecular Signatures

Feature selection in high-dimensional data, particularly in the context of biomarker discovery from gene expression datasets, is a pressing challenge within computational biology and bioinformatics. The paper under review presents an empirical analysis of 32 feature selection methods concerning their effects on the accuracy, stability, and interpretability of molecular signatures.

Overview of Methods

The approaches for feature selection in this paper are divided into three broad categories: filter methods, wrapper methods, and embedded methods. These methods were applied to four public gene expression datasets for breast cancer prognosis, emphasizing the statistical and biological importance of selecting relevant features from a plethora of variables.

Filter Methods: These methods independently pre-process data to select subsets of variables. The paper examines classical univariate filter methods using scoring functions such as the Student's t-test, Wilcoxon sum-rank test, Bhattacharyya distance, and relative entropy.
Wrapper Methods: Employing a learning machine as a black box, wrapper methods evaluate predictive power for subsets. Notably, SVM recursive feature elimination (RFE) and Greedy Forward Selection (GFS) are scrutinized for their computational feasibility in high dimensions.
Embedded Methods: These are learning algorithms that inherently perform feature selection during training, with Lasso regression and Elastic Net being pivotal methods assessed for their integration of variable selection and predictive model training.

Empirical Results

The analysis reveals that the selection method significantly impacts the accuracy, stability, and interpretability of resulting signatures. Intriguingly, simple univariate filters often outperform more intricate methods, with the Student's t-test emerging as a leading candidate for optimal feature selection. The famous ensemble feature selection approaches did not consistently improve results, indicating no substantial advantage over simpler models.

Accuracy and Stability

The predictive accuracy, evaluated through the area under the ROC curve (AUC), is relatively comparable across methods for large signatures, with no single method standing out except the t-test. Stability analysis, especially concerning gene lists, indicates that filter methods provide more consistent outputs compared to wrappers and embedded approaches under various stress-tests (soft-perturbation, hard-perturbation, between-datasets), supporting their preference in scenarios requiring robust gene list stability.

Interpretability

Functional enrichment analysis conducted post-feature selection examined the biological interpretability of gene signatures—primarily determined through Gene Ontology (GO) terms. Again, filter methods outperformed in consistently identifying relevant biological processes, although ensemble methods offered marginal improvements in some stability metrics.

Implications and Conclusions

The findings denote cautious optimism for simple, well-established statistical tests in feature selection, casting doubt on more sophisticated, computationally intensive approaches without clear empirical benefits. Practically, this suggests that researchers should prioritize stability and interpretability, particularly in genomic settings where sample sizes are constrained by current methodologies. Theoretically, this work invites further investigation into how best to balance complexity and rigor in feature selection algorithms with the reproducibility and biological significance of resulting biomarkers.

Future Directions

The paper delineates a need for additional exploration of ensemble methods, potentially involving novel strategies for feature aggregation. As methodologies in artificial intelligence and machine learning continue to evolve, integrating these insights could further refine biomarker discovery processes. Furthermore, larger and more diverse datasets may afford future research the ability to conclusively discern differences among feature selection methods, ultimately contributing to higher predictive validity and translational impact in clinical settings.