Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods (1911.02508v2)

Published 6 Nov 2019 in cs.LG, cs.AI, and stat.ML

Abstract: As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real-world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

PDF Abstract

Summary of "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods"

The paper "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods" by Slack et al. presents a rigorous examination of the vulnerabilities inherent to popular post hoc explanation techniques, specifically LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations). These explanatory methods are widely used in deploying machine learning models to ensure their interpretability, particularly in high-stakes domains such as healthcare and criminal justice. This paper demonstrates how these methods, which rely on input perturbations, are susceptible to adversarial attacks that can misleadingly mask the inherent biases of machine learning models.

Key Contributions

The key contributions of this paper are as follows:

Adversarial Scaffolding Technique: The researchers propose a novel scaffolding technique that enables the crafting of biased classifiers that can produce arbitrary desired explanations. This technique constructs a scaffolded classifier where its predictions on the actual data distribution remain biased, but its behavior on perturbed data points (used for generating explanations) appears unbiased.
Experimental Validation: Using extensive evaluations across multiple real-world datasets—including the COMPAS dataset, Communities and Crime dataset, and German Credit dataset—the paper showcases how biased classifiers created through this framework can successfully fool both LIME and SHAP into generating innocuous explanations.
Perturbation Detection and OOD Classifiers: The approach employs a classifier to detect out-of-distribution (OOD) instances resulting from input perturbations. This allows the scaffolded adversarial classifier to behave differently on real-world versus perturbed data points, thereby misleading the explanation techniques.

Numerical Results and Analysis

The paper presents strong numerical results to support its claims:

Effectiveness in Explanation Masking: For the COMPAS dataset, when using a single uncorrelated feature in the scaffolding, the attacked classifier successfully made the true bias feature (race) disappear from the top-3 important features in 100% of the test instances according to LIME and 84% according to SHAP. Similar effectiveness was observed in other datasets, with variations in the degree of success based on the features and classifiers used.
Perturbation Detection Accuracy: The paper explores the robustness of their OOD classifier, revealing that the success of their attack increases with the accuracy (F1 score) of the OOD classifier. Specifically, both LIME and SHAP are shown to be highly vulnerable to these adversarial attacks once the OOD classifier reaches reasonable accuracy.
Robustness to Hyperparameters: The adversarial classifiers remain effective under various hyperparameter configurations for LIME (e.g., kernel width) and SHAP (e.g., background distribution).

Implications and Future Directions

The implications of this research are profound for both theoretical machine learning and practical applications:

Trust in Explanatory Techniques: By demonstrating the ease with which LIME and SHAP can be fooled, the paper calls into question the reliability of these explanation techniques in high-stakes applications. This has significant consequences for regulatory frameworks and the deployment of ML models in sensitive areas.
Development of Robust Explanation Methods: The findings suggest an urgent need for developing more robust post hoc explanation techniques that can withstand adversarial attacks. Ensuring that explanations remain faithful to the underlying model behavior, even under adversarial conditions, should be a cornerstone of future research.
Enhanced Bias Detection: This work indicates that current bias detection mechanisms relying on explanation techniques might be insufficient. Future research should focus on integrating more rigorous, possibly multi-faceted approaches to detect and mitigate biases in ML models.

Conclusion

In summary, "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods" provides a thorough investigation into the vulnerabilities of widely used local explanation techniques. By crafting adversarial classifiers that mask biases from these techniques, the paper highlights substantial gaps in current methodologies. The results prompt a reassessment of the reliability of post hoc explanatory tools and point towards the necessity for developing more resilient and robust methods to ensure interpretability and fairness in machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Dylan Slack (17 papers)
Sophie Hilgard (10 papers)
Emily Jia (1 paper)
Sameer Singh (96 papers)
Himabindu Lakkaraju (88 papers)

Citations (733)

View on Semantic Scholar

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods (1911.02508v2)

Summary of "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods"

Key Contributions

Numerical Results and Analysis

Implications and Future Directions

Conclusion

Related Papers

GitHub

YouTube