PPFS: Predictive Permutation Feature Selection (2110.10713v1)

Published 20 Oct 2021 in cs.LG

Abstract: We propose Predictive Permutation Feature Selection (PPFS), a novel wrapper-based feature selection method based on the concept of Markov Blanket (MB). Unlike previous MB methods, PPFS is a universal feature selection technique as it can work for both classification as well as regression tasks on datasets containing categorical and/or continuous features. We propose Predictive Permutation Independence (PPI), a new Conditional Independence (CI) test, which enables PPFS to be categorised as a wrapper feature selection method. This is in contrast to current filter based MB feature selection techniques that are unable to harness the advancements in supervised algorithms such as Gradient Boosting Machines (GBM). The PPI test is based on the knockoff framework and utilizes supervised algorithms to measure the association between an individual or a set of features and the target variable. We also propose a novel MB aggregation step that addresses the issue of sample inefficiency. Empirical evaluations and comparisons on a large number of datasets demonstrate that PPFS outperforms state-of-the-art Markov blanket discovery algorithms as well as, well-known wrapper methods. We also provide a sketch of the proof of correctness of our method. Implementation of this work is available at \url{https://github.com/atif-hassan/PyImpetus}

Citations (12)

View on Semantic Scholar

Summary

The paper presents a novel wrapper-based feature selection method using a Predictive Permutation Independence test to enhance feature relevance.
It employs a two-phase strategy with a growth phase for candidate selection followed by a shrink phase to filter out false positives.
Empirical results demonstrate that PPFS selects nearly 50% fewer features than competitors while maintaining superior predictive performance.

Overview of "PPFS: Predictive Permutation Feature Selection"

The paper "PPFS: Predictive Permutation Feature Selection" introduces a novel wrapper-based method for feature selection, leveraging the concept of Markov Blanket (MB) and the introduction of a new Conditional Independence (CI) test termed Predictive Permutation Independence (PPI). The authors aim to address limitations in existing MB-based feature selection methods by creating a versatile technique applicable to both classification and regression tasks across datasets with diverse feature types.

Methodological Contributions

The key innovation in this research is the development of the Predictive Permutation Independence test, which refines feature importance assessment by using supervised learning models. The incorporation of advanced algorithms like Gradient Boosting Machines (GBM) allows this test to effectively measure the association between features and the target variable under the knockoff framework. This marks a departure from traditional filter-based MB methods that lack the capacity to exploit newer algorithmic advancements.

PPFS employs a two-phased approach:

Growth Phase: This phase identifies the initial candidate Markov Blanket set, based on marginal independence of features with minimal assumption requirements.
Shrink Phase: Subsequent filtering occurs to eliminate false positives from the candidate set by assessing conditional independence.

An additional methodological enhancement includes a novel Markov Blanket aggregation step aimed at overcoming sample inefficiency. By integrating sample splitting and aggregation, the approach tackles both the need for large sample sizes and scenarios where datasets do not satisfy faithfulness assumptions.

Empirical Evaluation

The empirical evaluation showcases PPFS's superiority over state-of-the-art MB discovery algorithms such as Mixed-MB and SGAI, as well as prominent wrapper methods like stepwise selection and permutation feature importance. Across diverse datasets from the UCI Machine Learning Repository and the NIPS 2003 feature selection challenge, PPFS consistently outperforms other methods in terms of reducing prediction error and improving classification accuracy with fewer features.

The resilience of PPFS is particularly evident in high-dimensional datasets, where it demonstrates remarkable scalability, selecting approximately 50% fewer features than competitors while still achieving state-of-the-art predictive performance. Such results reinforce PPFS as a highly efficient feature selector capable of maintaining and even enhancing model performance in complex data scenarios.

Theoretical Implications and Future Research

The paper substantiates PPFS's theoretical viability through a sketch of correctedness proof under the faithful assumption in Bayesian networks. It is argued that the PPI test adequately nuances dependencies that other CI tests might overlook, thereby ensuring holistic feature selection.

While this work provides substantial practical contributions to feature subset selection, several avenues for future exploration are indicated:

Extended Application: Applying the PPFS framework to other domains beyond standard machine learning tasks could amplify its utility, potentially informing strategies in domains like computational biology or social network analysis where mixed-type datasets and non-linear interactions prevail.
Robustness Studies: Investigating the robustness of PPFS in the presence of noisy or incomplete data would provide further insights into its adaptability in real-world applications.
Efficiency Optimization: Enhancing the computational efficiency of PPFS, particularly when dealing with massive datasets, can further boost its appeal for industrial applications, where computational resources may constrain model development cycles.

Conclusion

The Predictive Permutation Feature Selection approach presents a significant step forward in feature selection strategies, combining novel statistical testing under the PPFS framework with practical performance benefits. This positions PPFS as a competitive methodology, advancing existing paradigms by fostering a better balance between computational complexity and predictive accuracy. With ongoing developments, PPFS holds promise for contributing to advances in automated feature engineering, optimization of learning pipelines, and improved interpretability across various AI-driven fields.

PDF Markdown

Related Papers

GitHub

GitHub - atif-hassan/PyImpetus: PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features (129 stars)