Attribution-based Explanations that Provide Recourse Cannot be Robust (2205.15834v3)
Abstract: Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions, and we provide sufficient conditions for specific classes of continuous functions to be recourse sensitive. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of $x$, by providing an exact characterization of the functions $f$ to which impossibility applies.
- Sanity checks for saliency maps. In Advances in neural information processing systems, NeurIPS, 2018.
- Towards the unification and robustness of perturbation and gradient based explanations. In International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research. PMLR, 2021.
- On the robustness of interpretability methods. In Proceedings of the 2018 Workshop on Human interpretability in Machine Learning. ICML, 2018.
- Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 2020.
- Geometrically enriched latent spaces. In International Conference on Artificial Intelligence and Statistics, AISTATS, Proceedings of Machine Learning Research. PMLR, 2021.
- Set-valued analysis. Springer Science & Business Media, 2009.
- Claude Berge. Topological Spaces: including a treatment of multi-valued functions, vector spaces, and convexity. Courier Corporation, 1997.
- Impossibility theorems for feature attribution. arXiv preprint arXiv:2212.11870, 2022.
- Consistent counterfactuals for deep models. In International Conference on Learning Representations, ICLR, 2022.
- Post-hoc explanations fail to achieve their purpose in adversarial contexts. In Conference on Fairness, Accountability, and Transparency, FAccT. ACM, 2022.
- Multi-objective counterfactual explanations. In International Conference on Parallel Problem Solving from Nature. Springer, 2020.
- Opportunities and challenges in Explainable Artificial Intelligence (XAI): A survey. arXiv preprint arXiv:2006.11371, 2020.
- Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in neural information processing systems, NeurIPS, 2018.
- Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, NeurIPS, 2019.
- Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Paul Erdös. Some remarks on the measurability of certain sets. Bulletin of the American Mathematical Society, 1945.
- The robustness of counterfactual explanations over time. IEEE Access, 2022.
- Generalizing the poincaré–miranda theorem: the avoiding cones condition. Annali di Matematica Pura ed Applicata, 2016.
- Counterfactual evaluation for explainable ai. arXiv preprint arXiv:2109.01962, 2021.
- Interpretation of neural networks is fragile. In Conference on Artificial Intelligence, AAAI. AAAI Press, 2019.
- A survey of methods for explaining black box models. ACM computing surveys (CSUR), 2018.
- Robust counterfactual explanations for neural networks with probabilistic guarantees. In International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research. PMLR, 2023.
- A benchmark for interpretability methods in deep neural networks. In Advances in neural information processing systems, NeurIPS, 2019.
- Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL. Association for Computational Linguistics, 2020.
- Diagnosing ai explanation methods with folk concepts of behavior. In Conference on Fairness, Accountability, and Transparency, FAccT. ACM, 2023.
- Towards realistic individual recourse and actionable explanations in black-box decision making systems. arXiv preprint arXiv:1907.09615, 2019.
- Model-agnostic counterfactual explanations for consequential decisions. In International Conference on Artificial Intelligence and Statistics, AISTATS, Proceedings of Machine Learning Research. PMLR, 2020.
- A survey of algorithmic recourse: contrastive explanations and consequential recommendations. ACM Computing Surveys (CSUR), 2021.
- If only we had better counterfactual explanations. In International Joint Conference on Artificial Intelligence, IJCAI, 2021.
- The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Lecture Notes in Computer Science. Springer, 2019.
- The dangers of post-hoc interpretability: unjustified counterfactual explanations. In International Joint Conference on Artificial Intelligence, IJCAI, 2019.
- Towards falsifiable interpretability research. arXiv preprint arXiv:2010.12016, 2020.
- Explainable AI: A review of machine learning interpretability methods. Entropy, 2020.
- Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 2018.
- A unified approach to interpreting model predictions. In Advances in Neural Information Processing systems, NeurIPS, 2017.
- Bert Mendelson. Introduction to topology. Courier Corporation, 1990.
- Christoph Molnar. Interpretable Machine Learning. 2 edition, 2022.
- Explaining machine learning classifiers through diverse counterfactual explanations. In Conference on Fairness, Accountability, and Transparency, FAccT. ACM Press, 2020.
- On the trade-off between actionable explanations and the right to be forgotten. In International Conference on Learning Representations, ICLR, 2022.
- Face: feasible and actionable counterfactual explanations. In Conference on AI, Ethics, and Society, AIES. AAAI, ACM, 2020.
- "Why should I trust you?" Explaining the predictions of any classifier. In International Conference on Knowledge Discovery and Data Mining, SIGKDD. ACM, 2016.
- Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 2019.
- Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Lecture Notes in Computer Science. Springer Nature, 2019.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations, ICLR, Workshop Track Proceedings, 2014.
- Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Conference on AI, Ethics, and Society, AIES. AAAI, ACM, 2020.
- Smoothgrad: removing noise by adding noise. In Proceedings of the 2017 Workshop on Visualization for Deep Learning. ICML, 2017.
- Axiomatic attribution for deep networks. In International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research. PMLR, 2017.
- Towards robust and reliable algorithmic recourse. In Advances in Neural Information Processing Systems, NeurIPS, 2021.
- Actionable recourse in linear classification. In Conference on Fairness, Accountability, and Transparency, FAccT. ACM Press, 2019.
- scikit-image: image processing in Python. PeerJ, 2011.
- Kush R. Varshney. Trustworthy Machine Learning. Independently Published, Chappaqua, NY, USA, 2022.
- Counterfactual explanations for machine learning: A review. arXiv preprint arXiv:2010.10596, 2020.
- Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 2017.
- Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 2021.