The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations (1907.09294v1)

Published 22 Jul 2019 in cs.LG, cs.AI, and stat.ML

Abstract: Post-hoc interpretability approaches have been proven to be powerful tools to generate explanations for the predictions made by a trained black-box model. However, they create the risk of having explanations that are a result of some artifacts learned by the model instead of actual knowledge from the data. This paper focuses on the case of counterfactual explanations and asks whether the generated instances can be justified, i.e. continuously connected to some ground-truth data. We evaluate the risk of generating unjustified counterfactual examples by investigating the local neighborhoods of instances whose predictions are to be explained and show that this risk is quite high for several datasets. Furthermore, we show that most state of the art approaches do not differentiate justified from unjustified counterfactual examples, leading to less useful explanations.

PDF Abstract

Analyzing the Risks and Implications of Unjustified Post-Hoc Counterfactual Explanations

The paper entitled "The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations" by Laugel et al. presents a critical evaluation of post-hoc interpretability methods, particularly focusing on counterfactual explanations. In the context of machine learning, interpretability is a crucial aspect, primarily due to the increasing deployment of black-box models in critical decision-making systems. The paper scrutinizes the reliability of post-hoc methods, which are prevalent in generating explanations for model predictions, highlighting potential pitfalls such as generating spurious, or unjustified, explanations that are disconnected from the original training data.

The core analysis in this paper revolves around counterfactual explanations — instances generated to demonstrate minimal perturbations required to alter the predicted class of an observation — and their fidelity to the actual data distribution. The authors argue that certain generated counterfactuals may not be justifiable, meaning they are not connected to the training data via a continuous path. They propose a formal criterion for justified counterfactuals, requiring a connection to existing instances in the training data without crossing decision boundaries, ensuring that these are grounded in reality and not artifacts of the model's peculiarities.

To evaluate the prevalence of unjustified counterfactuals, the authors introduce the Local Risk Assessment (LRA) procedure. This statistical technique examines local neighborhoods around instances to detect the presence of counterfactual explanations that are artifacts instead of valid generalizations. Their findings across multiple datasets reveal a significant risk of generating unjustified counterfactuals — up to 81% in the Recidivism dataset they examined, which poses a substantial challenge to the reliability of post-hoc methods used in practice today.

In addition to LRA, the paper proposes a Vulnerability Evaluation (VE) procedure to assess the susceptibility of common counterfactual generation methods to this problem. Three methods — HCLS, Growing Spheres, and LORE — are evaluated. The authors found that in contexts with high local complexity or noise, these methods often produce unjustified counterfactual explanations. For instance, Growing Spheres and HCLS frequently generated unjustified explanations in risk-prone regions.

The paper’s outcomes have profound implications for researchers and practitioners relying on interpretability methods, particularly in sensitive domains such as finance or healthcare, where unjustified explanations could propagate incorrect conclusions and affect decision-making processes. The authors advocate for an increase in the employment of methods incorporating ground-truth data during the post-hoc analysis to ensure the validity of the explanations provided.

This research exposes the possibility that these interpretations, meant to provide transparency, could in fact obfuscate or distort the underlying reality when they stem from inappropriate artifacts of the trained model. Consequently, this demands a reevaluation of how interpretability techniques should be designed and used. Future research must address this gap, possibly through hybrid interpretability models that better integrate post-hoc intuitions with inherent model transparency or validated grounding in the training data.

Overall, the paper offers a nuanced critique of post-hoc interpretability approaches, highlighting significant vulnerabilities. It contributes to the discourse on trustworthy AI by underscoring the need for caution in relying on post-hoc explanations and provides empirical evidence urging the community to develop more robust and faithful interpretability tools. As AI systems continue to permeate various aspects of life and society, the fidelity of interpretability methods remains an urgent concern warranting both theoretical and practical advancements.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Thibault Laugel (18 papers)
Marie-Jeanne Lesot (22 papers)
Christophe Marsala (9 papers)
Xavier Renard (14 papers)
Marcin Detyniecki (41 papers)

Citations (181)

View on Semantic Scholar

The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations (1907.09294v1)

Analyzing the Risks and Implications of Unjustified Post-Hoc Counterfactual Explanations

Related Papers