Faithfulness Tests for Natural Language Explanations (2305.18029v2)
Abstract: Explanations of neural models aim to reveal a model's decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model's inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.
- Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31.
- Post hoc explanations may be ineffective for detecting unknown spurious correlation. In International Conference on Learning Representations.
- Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning, pages 314–323. PMLR.
- A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.
- Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7352–7364, Online. Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods. In NeurIPS 2019 Workshop Safety and Robustness in Decision Making.
- The struggles of feature-based explanations: Shapley values vs. minimal sufficient subsets. In AAAI 2021 Workshop on Explainable Agency in Artificial Intelligence.
- e-SNLI: Natural Language Inference with Natural Language Explanations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9539–9549. Curran Associates, Inc.
- Make up your mind! adversarial generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4157–4165, Online. Association for Computational Linguistics.
- Frame: Evaluating simulatability metrics for free-text rationales. arXiv preprint arXiv:2207.00779.
- A comparative study of faithfulness metrics for model interpretability methods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029–5038, Dublin, Ireland. Association for Computational Linguistics.
- ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.
- Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Christiane Fellbaum. 2010. Wordnet. In Theory and Applications of Ontology: Computer Applications, pages 231–243. Springer.
- Riccardo Guidotti. 2022. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and Knowledge Discovery, pages 1–55.
- Leif Hancox-Li. 2020. Robustness in machine learning explanations: does it matter? In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 640–647.
- Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Online. Association for Computational Linguistics.
- spaCy: Industrial-strength Natural Language Processing in Python.
- Evaluations and methods for explanation through robustness analysis. In International Conference on Learning Representations.
- Contrastive explanations for model interpretability. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1597–1611, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Generating Fluent Fact Checking Explanations with Unsupervised Post-Editing. Information, 13(10).
- e-ViL: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1244–1254.
- Explaining chest x-ray pathologies in natural language. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 701–713, Cham. Springer Nature Switzerland.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1731–1751, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Knowledge-grounded self-rationalization via extractive and natural language explanations. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 14786–14801. PMLR.
- Few-shot self-rationalization with natural language prompts. Findings of NAACL.
- Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267:1–38.
- WT5?! training text-to-text models to explain their predictions.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
- Explaining NLP models via minimal contrastive editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3840–3852, Online. Association for Computational Linguistics.
- Counterfactual explanations can be manipulated. Advances in Neural Information Processing Systems, 34:62–75.
- Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, page 180–186, New York, NY, USA. Association for Computing Machinery.
- Investigating the benefits of free-form rationales. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5867–5882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 307–321, Barcelona (online). International Committee for Computational Linguistics.
- Sarah Wiegreffe and Ana Marasovic. 2021. Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Measuring association between labels and free-text rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Online. Association for Computational Linguistics.
- On the Sensitivity and Stability of Model Interpretations in NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2631–2647, Dublin, Ireland. Association for Computational Linguistics.
- Few-Shot Out-of-Domain Transfer of Natural Language Explanations. In Proceedings of the Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Rethinking cooperative rationalization: Introspective extraction and complement control. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4094–4103, Hong Kong, China. Association for Computational Linguistics.
- Pepa Atanasova (27 papers)
- Oana-Maria Camburu (29 papers)
- Christina Lioma (66 papers)
- Thomas Lukasiewicz (125 papers)
- Jakob Grue Simonsen (43 papers)
- Isabelle Augenstein (131 papers)