Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining (1711.08477v2)

Published 22 Nov 2017 in cs.LG

Abstract: Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by theRelief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF* performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.

Citations (196)

View on Semantic Scholar

Summary

The paper introduces a comprehensive simulation study using the ReBATE software to benchmark various Relief-Based Algorithms (RBAs) for feature selection in diverse bioinformatics datasets.
Findings indicate RBAs effectively identify features in complex, noisy data with interactions, showing MultiSURF is generally robust while ReliefF performance depends heavily on parameter tuning.
The study suggests RBAs are robust for handling complex interactions and genetic heterogeneity in bioinformatics data, providing a foundation for future algorithmic advancements.

Evaluating the Efficacy of Relief-Based Feature Selection Methods in Bioinformatics

The paper "Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining" provides an in-depth examination of Relief-Based Algorithms (RBAs) for feature selection in bioinformatics. The authors detail a comprehensive simulation paper to evaluate the performance of various RBAs across diverse problem domains. The paper responds to the growing need for adaptable, efficient, and robust feature selection methods in biomedical data mining, where data complexity, scale, and noise present significant challenges.

RBAs, notably recognized for their ability to capture feature dependencies and interactions, are benchmarked against standard feature selection methods like chi-square tests, ANOVA, mutual information, and wrapper methods. The paper focuses on the strengths and limitations of these algorithms when applied to various bioinformatics data types, including genetic variants and gene expression data.

The authors introduce the open-source software ReBATE, which implements core RBAs—such as ReliefF, SURF, SURF*, MultiSURF*, and the introduced MultiSURF algorithm—enabling wider accessibility and applicability to real-world bioinformatics tasks. The simulation paper incorporates 2280 datasets encompassing a variety of problem complexities, such as epistasis, genetic heterogeneity, continuous and mixed data types, and varying feature space sizes.

Key findings highlight that while standard myopic feature selection methods struggle with complex feature interactions, RBAs show significant efficacy in identifying relevant features in noisy environments with complex interactions. MultiSURF*, for instance, excels in detecting 2-way interactions, although the incorporation of 'far' scoring can impede performance in scenarios dominated by main effects. In contrast, MultiSURF demonstrates robust and consistent performance across diverse dataset configurations without the need for parameter tuning, making it a reliable choice for general application.

The results notably underline that ReliefF's performance is heavily influenced by its parameter settings, particularly the number of nearest neighbors used during the feature scoring process. This insight emphasizes the importance of thoughtful parameter selection when applying ReliefF, especially in datasets characterized by high dimensionality and interaction complexities.

The paper’s implications for bioinformatics are notable. RBAs offer a robust solution for feature selection in scenarios plagued by complex interactions and heterogeneous associations, a common feature in genetic data analysis. They notably function well in detecting genetic heterogeneity patterns, essential for subsetting populations or identifying group-specific effects.

Moreover, the findings suggest potential avenues for further improvement of RBAs, such as optimizing strategies for detecting higher-order interactions and refining methods for handling mixed-type datasets. Future research can build on these insights to refine iterative RBAs and explore novel distance-weighted scoring techniques, raising the potential for even more adaptive and high-performing feature selection algorithms in bioinformatics.

Overall, the authors deliver a thorough and insightful analysis of RBAs in bioinformatics, providing practical guidance and a solid foundation for ongoing research and application in feature selection methodologies. The paper not only benchmarks existing methods but also paves the way for further algorithmic advancements tailored to the nuanced challenges of modern biomedical data mining.

Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining (1711.08477v2)

Summary

Evaluating the Efficacy of Relief-Based Feature Selection Methods in Bioinformatics

Related Papers