Variable Importance in High-Dimensional Settings Requires Grouping (2312.10858v1)

Published 18 Dec 2023 in cs.LG

Abstract: Explaining the decision process of machine learning algorithms is nowadays crucial for both model's performance enhancement and human comprehension. This can be achieved by assessing the variable importance of single variables, even for high-capacity non-linear methods, e.g. Deep Neural Networks (DNNs). While only removal-based approaches, such as Permutation Importance (PI), can bring statistical validity, they return misleading results when variables are correlated. Conditional Permutation Importance (CPI) bypasses PI's limitations in such cases. However, in high-dimensional settings, where high correlations between the variables cancel their conditional importance, the use of CPI as well as other methods leads to unreliable results, besides prohibitive computation costs. Grouping variables statistically via clustering or some prior knowledge gains some power back and leads to better interpretations. In this work, we introduce BCPI (Block-Based Conditional Permutation Importance), a new generic framework for variable importance computation with statistical guarantees handling both single and group cases. Furthermore, as handling groups with high cardinality (such as a set of observations of a given modality) are both time-consuming and resource-intensive, we also introduce a new stacking approach extending the DNN architecture with sub-linear layers adapted to the group structure. We show that the ensuing approach extended with stacking controls the type-I error even with highly-correlated groups and shows top accuracy across benchmarks. Furthermore, we perform a real-world data analysis in a large-scale medical dataset where we aim to show the consistency between our results and the literature for a biomarker prediction.

Citations (2)

View on Semantic Scholar

Summary

The paper presents BCPI, a framework that evaluates grouped variable importance to address misleading outcomes in high-dimensional settings.
It employs an internal stacking technique within deep neural networks to reduce computation time while boosting interpretability.
Extensive benchmarks on synthetic and medical datasets confirm that BCPI reliably identifies key variable groups with statistical guarantees.

Introduction

The concept of variable importance serves as a critical component in understanding machine learning models. It sheds light on which variables play pivotal roles in predictions, particularly relevant for high-capacity non-linear models such as Deep Neural Networks (DNNs). However, standard tools for variable importance can yield misleading outcomes when variables are highly correlated. This challenge is intensified in high-dimensional spaces, where variable importance methods may return spurious results due to prohibitive computational costs and correlation issues.

Group-Based Variable Importance

A proposed solution to address these limitations is to assess groups of variables rather than individual ones. This approach has a two-fold benefit: it provides better interpretability and reduces computational time. Determining the significance of groups rather than individual variables can offer insight into the collective impact of related features on model predictions. Nonetheless, previously established group-based methods have often overlooked the crucial aspect of statistical guarantees, specifically type-I error control, which refers to the rate of incorrectly identifying irrelevant variables as relevant.

The BCPI Framework

In response to this issue, the paper presents the Block-Based Conditional Permutation Importance (BCPI), a novel framework designed for evaluating the significance of both single variables and groups, with implicit statistical guarantees. The BCPI approach incorporates an internal stacking technique within a DNN model, which involves linearly projecting each group, significantly diminishing computation time. Extensive benchmarking on synthetic and real-world medical datasets demonstrate the efficacy of the BCPI method, achieving high prediction accuracy while delivering reliable identification of predictively important variable groups.

Practical Significance and Future Prospects

By employing BCPI, researchers can build robust prediction models and gain dependable insights into the influential variables. In the context of the UK Biobank dataset used for age prediction, BCPI confirmed the importance of brain imaging and socio-demographic information groups. While BCPI shows promise, it is important to note its dependence on a DNN as the base estimator. The use of a simpler base learner, like a Random Forest, might be advantageous when training data is sparse to avoid overfitting. Future research could explore the impact of non-linear projections in internal stacking and the implications of missing or low values on accuracy. Additionally, while predefined groups were used in the UK Biobank case, the exploration of statistically defined groups through clustering could be a compelling avenue for research.