Impossibility Theorems for Feature Attribution (2212.11870v3)

Published 22 Dec 2022 in cs.LG and cs.AI

Abstract: Despite a sea of interpretability methods that can produce plausible explanations, the field has also empirically seen many failure cases of such methods. In light of these results, it remains unclear for practitioners how to use these methods and choose between them in a principled way. In this paper, we show that for moderately rich model classes (easily satisfied by neural networks), any feature attribution method that is complete and linear -- for example, Integrated Gradients and SHAP -- can provably fail to improve on random guessing for inferring model behaviour. Our results apply to common end-tasks such as characterizing local model behaviour, identifying spurious features, and algorithmic recourse. One takeaway from our work is the importance of concretely defining end-tasks: once such an end-task is defined, a simple and direct approach of repeated model evaluations can outperform many other complex feature attribution methods.

Citations (42)

View on Semantic Scholar

Summary

The paper establishes that complete, linear feature attribution methods often perform no better than random guessing in counterfactual analysis.
It rigorously formalizes the limitations of popular techniques such as Integrated Gradients and SHAP through impossibility theorems supported by empirical data.
The study proposes a brute-force querying approach as a reliable, though computationally intensive, alternative, urging the development of more robust interpretability methods.

An Overview of "Impossibility Theorems for Feature Attribution"

The paper "Impossibility Theorems for Feature Attribution" by Blair Bilodeau, Natasha Jaques, Pang Wei Koh, and Been Kim addresses the challenges and limitations of feature attribution methods in machine learning. These methods are crucial for interpreting the behavior of complex models like neural networks, particularly in making sense of decisions or predictions. However, this research presents theoretical results showing inherent limitations in popular attribution techniques such as Integrated Gradients and SHAP.

Key Findings and Theoretical Implications

The main result is the establishment that for a wide class of models, any feature attribution method satisfying completeness and linearity, such as Integrated Gradients and SHAP, can only match random guessing when attempting to infer certain aspects of model behavior. This applies to tasks such as algorithmic recourse, which involves determining the necessary changes to a feature to achieve a desired model prediction, and identifying spurious features that should not influence predictions but do.

The authors formalize counterfactual model behavior, which is central to many attribution tasks, and present impossibility theorems that demonstrate the inability of these methods to reliably infer changes in model behavior. The paper highlights two major flaws in complete and linear methods:

Sensitivity to how the model behaves on a baseline distribution that might be far from the test data, causing potential inconsistencies.
The completeness criterion causes attributions to sum to a meaningful reference value, which unfortunately can misalign with resolving counterfactual questions.

Theoretical Results and Empirical Validation

The paper goes beyond theoretical demonstration by empirically validating these claims using trained neural networks on standard datasets. The experimental results show that these attribution methods often perform on par with random guessing when applied to real-world data across a variety of tasks and datasets.

Additionally, the authors propose a rudimentary brute-force approach as a provable, although computationally expensive, method to infer model behavior by repeated querying. This method, which ensures theoretical reliability, highlights the need for further work in developing feature attribution techniques that are both computationally efficient and theoretically robust.

Implications for AI Practitioners and Future Research

The findings in this paper caution practitioners against over-reliance on popular attribution methods for high-stakes decision-making, especially in complex models. For AI researchers, these results underline the necessity of developing attribution methods that can achieve stronger guarantees. There is a clear call to refine these methods or develop new ones underpinned with sound theoretical results. For application scenarios that require high reliability, methods that are task-specific and leverage model-specific insights might be necessary.

Moreover, the exploration of the sample complexity of achieving reliable inference via brute-force methods paves the way for new investigations into more efficient algorithmic solutions. Exploring optimization techniques and formal decision-theoretic frameworks might contribute significantly to enhancing the efficacy of interpretability methods in machine learning.

In conclusion, this paper makes substantial contributions to the understanding of the limitations of feature attribution methods in machine learning. It challenges the community to rethink the design and application of interpretability tools, ensuring they are robust enough for practical use while guiding theoretical advances in this critical area of AI research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/blairbilodeau/status/1745486711033889079

https://twitter.com/_beenkim/status/1745500385085956428

https://twitter.com/johnlashlee/status/1777233298789667103

https://twitter.com/GaborBekes/status/1777702002862256212

https://twitter.com/moorejh/status/1806350902426833004

https://twitter.com/GeiszlerDaniel/status/1777258302868439360