Survey on Feature Selection (1510.02892v1)

Published 10 Oct 2015 in cs.LG

Abstract: Feature selection plays an important role in the data mining process. It is needed to deal with the excessive number of features, which can become a computational burden on the learning algorithms. It is also necessary, even when computational resources are not scarce, since it improves the accuracy of the machine learning tasks, as we will see in the upcoming sections. In this review, we discuss the different feature selection approaches, and the relation between them and the various machine learning algorithms.

Citations (224)

View on Semantic Scholar

Summary

The paper systematically categorizes and evaluates feature selection methods for both supervised and unsupervised learning models.
It demonstrates how reducing high-dimensional data improves computational efficiency and model accuracy by addressing the bias-variance trade-off.
The analysis highlights the trade-offs between filter and wrapper approaches and recommends future research to integrate unsupervised feature selection.

Survey on Feature Selection: An Analytical Overview

The paper "Survey on Feature Selection" by Tarek Amr and Beatriz de La Iglesia provides a comprehensive analysis of feature selection techniques and their role in machine learning. The investigation is at the core of enhancing computational efficiency and improving the performance of machine learning models by mitigating the burden induced by high-dimensional datasets. The paper systematically categorizes and evaluates feature selection methodologies with a focus on their application in both supervised and unsupervised learning contexts.

Introduction to Feature Selection

Feature selection addresses the challenges presented by excessive and irrelevant data features in computational tasks. It is crucial not only for reducing computational overhead but also for refining model accuracy and interpretability. The review underscores the notion that effective feature selection strikes a balance between bias and variance, contributing to an accurate and generalizable model. This balance, known as the "bias-variance trade-off," is central to feature selection and is pivotal in determining the optimal feature subset for machine learning tasks.

The Feature Selection Process

The process of feature selection is typically segmented into three primary steps: search, evaluation, and stoppage. The search step involves generating candidate feature subsets, which are then evaluated for relevance and utility. The algorithms for this purpose range from simple forward selection to more complex ones like genetic algorithms, which eschew the traditional monotonicity assumptions.

Forward selection and backward elimination are core algorithms for feature selection. They optimize the process by either incrementally adding features or progressively removing them. However, these algorithms can be limited by local optima, prompting the development of alternative methods like genetic algorithms which use operations such as crossover and mutation to explore the feature space.

Filters and Wrappers: Approaches to Evaluation

Feature selection approaches are categorized broadly into filters and wrappers based on their evaluation strategy. Filters rank features or subsets independently of any learning model, employing statistical measures like mutual information and chi-square tests. Wrappers, in contrast, evaluate subsets based on the performance of a particular learning algorithm, using cross-validation to assess predictive accuracy. While filters are computationally efficient, they do not account for the interactions within the model, a consideration that is fundamental to the wrapper approach, albeit requiring more resources.

Integration with Learning Algorithms

The selection of features is intricately tied to the type of learning algorithm in use. Instance-based and probabilistic models like k-nearest neighbors and Naive Bayes exhibit varying degrees of sensitivity to the chosen features. Furthermore, some models like decision trees implicitly perform feature selection during training, but still benefit from preprocessing to refine the input space. The empirical nature of feature selection efficacy is highlighted through experiments, illustrating the impact of dataset characteristics and distribution on the selection strategy and outcome.

Conclusion and Future Directions

The paper concludes with the acknowledgment of the trade-offs inherent in feature selection methodologies: accuracy versus computational cost and individual versus subset evaluation. The authors suggest that future work should aim to integrate the strengths of filter methods in subset selection and further extend their applicability to unsupervised learning. This survey not only elucidates the intricacies associated with feature selection but also acts as a guide for researchers in selecting appropriate methodologies tailored to specific datasets and learning paradigms, thus steering future advancements in machine learning towards more efficient and powerful models.

PDF Markdown