Feature Selection: A Data Perspective (1601.07996v5)

Published 29 Jan 2016 in cs.LG

Abstract: Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity based, information theoretical based, sparse learning based and statistical based methods. To facilitate and promote the research in this community, we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms (\url{http://featureselection.asu.edu/}). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.

Citations (660)

View on Semantic Scholar

Summary

The paper comprehensively surveys and categorizes key feature selection methodologies, highlighting strengths and limitations.
It compares similarity-based, information-theoretical, sparse learning, and statistical methods for effective dimensionality reduction.
The study emphasizes practical implications for scalable analytics in big data, supporting improved model performance.

Feature Selection: A Data Perspective

The paper "Feature Selection: A Data Perspective" offers a comprehensive survey of feature selection methodologies, particularly in the context of big data challenges. Feature selection, a crucial preprocessing technique, aims to enhance model simplicity, comprehensibility, performance and data interpretability by selecting a subset of relevant features from a possibly high-dimensional dataset.

Overview of Feature Selection

Feature selection addresses dimensionality reduction by maintaining original feature semantics, unlike feature extraction which transforms them into a new space. While high dimensionality poses issues like increased storage, computational costs, and model overfitting, feature selection mitigates these through efficient data handling.

Methodological Categorization

Feature selection is categorized methodologically into four primary classes:

Similarity-Based Methods: These approaches, such as Laplacian Score and SPEC, focus on preserving the data's inherent structure by utilizing affinity matrices to evaluate feature importance. While effective, they may overlook feature redundancy.
Information-Theoretical Methods: These methods, including MRMR and CMIM, leverage statistical dependencies among features and target variables to maximize relevance and minimize redundancy. However, they primarily cater to discrete data and are predominantly supervised.
Sparse Learning Methods: Utilizing sparse regularization terms (e.g., LASSO), these methods embed feature selection within learning algorithms themselves, benefiting model interpretability and performance. Despite their robustness, they are often computationally intensive.
Statistical Methods: Simple and computationally efficient, these methods (e.g., Chi-square and Gini Index) use statistical metrics to filter features individually, often neglecting correlation among features.

The paper further details feature selection strategies that consider structured features (group, tree, graph) and heterogeneous data, including linked, multi-source, and multi-view datasets.

Challenges and Future Directions

Big data introduces specific challenges, such as the need for scalable algorithms that can handle vast datasets efficiently. The stability of feature selection results under data perturbations and optimal model parameter fitting remain open problems. There's a call for more research into unsupervised feature selection methods, considering the practical difficulty in acquiring labeled data. Additionally, exploring strategies for automatic determination of the number of features to select, without heavy reliance on heuristic searches, is crucial.

Practical Implications

Practically, integrating feature selection in data mining tasks leads to enhanced model generalization, reduced overfitting, and improved performance across diverse applications, including bioinformatics, text mining, and social media analysis. The paper's accompanying open-source repository, scikit-feature, facilitates research and application of these techniques, serving as a benchmark and educational resource.

Conclusion

This survey systematically explores the landscape of feature selection from a data-driven perspective, aligning methodologies with emerging data challenges in the era of big data. By categorizing algorithms beyond traditional views, this work sheds light on the adaptability and efficiency required in future feature selection research, critical for advancements in AI and data science methodologies.

PDF Markdown

Related Papers

YouTube

Show All Videos