Explanation Regularisation through the Lens of Attributions
Abstract: Explanation regularisation (ER) has been introduced as a way to guide text classifiers to form their predictions relying on input tokens that humans consider plausible. This is achieved by introducing an auxiliary explanation loss that measures how well the output of an input attribution technique for the model agrees with human-annotated rationales. The guidance appears to benefit performance in out-of-domain (OOD) settings, presumably due to an increased reliance on "plausible" tokens. However, previous work has under-explored the impact of guidance on that reliance, particularly when reliance is measured using attribution techniques different from those used to guide the model. In this work, we seek to close this gap, and also explore the relationship between reliance on plausible features and OOD performance. We find that the connection between ER and the ability of a classifier to rely on plausible features has been overstated and that a stronger reliance on plausible tokens does not seem to be the cause for OOD improvements.
- Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197.
- AI@Meta. 2024. Llama 3 model card.
- Marta: Leveraging human rationales for explainable text classification. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 5868–5876.
- Stephen P Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press.
- e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- What to learn, and how: Toward effective learning from rationales. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1075–1088.
- Evaluating and characterizing human rationales. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9294–9307.
- Unirex: A unified learning framework for language model rationale extraction. In International Conference on Machine Learning, pages 2867–2889. PMLR.
- Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791.
- Optimizing relevance maps of vision transformers improves robustness. Advances in Neural Information Processing Systems.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.
- Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458.
- William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning.
- Learning to scaffold: Optimizing model explanations for teaching. In Advances in Neural Information Processing Systems.
- Measuring the mixing of contextual information in the transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8698–8714.
- The eval4nlp shared task on explainable quality estimation: Overview and results. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 165–178.
- Saliency learning: Teaching the model where to pay attention. In Proceedings of NAACL-HLT, pages 4016–4025.
- Mareike Hartmann and Daniel Sonntag. 2022. A survey on improving nlp models with human explanations. In Proceedings of the First Workshop on Learning with Natural Language Supervision, pages 40–47.
- Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. In Proceedings of the First Workshop on Learning with Natural Language Supervision, pages 29–39, Dublin, Ireland. Association for Computational Linguistics.
- spacy: Industrial-strength natural language processing in python.
- Bridging language and items for retrieval and recommendation. arXiv preprint arXiv:2403.03952.
- Improving deep learning interpretability by saliency guided training. Advances in Neural Information Processing Systems, 34:26726–26739.
- Alon Jacovi and Yoav Goldberg. 2021. Aligning Faithful Interpretations with their Social Attribution. Transactions of the Association for Computational Linguistics, 9:294–310.
- Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556.
- ER-test: Evaluating explanation regularization methods for language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3315–3336, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Investigating the influence of noise and distractors on the interpretation of neural networks. arXiv preprint arXiv:1611.07270.
- Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7057–7075, Online. Association for Computational Linguistics.
- Incorporating Residual and Normalization Layers into Analysis of Masked Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4547–4568, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896.
- Frederick Liu and Besim Avci. 2019. Incorporating priors with feature attribution on text classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6274–6283.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Mohammad Reza Ghasemi Madani and Pasquale Minervini. 2023. Refer: An end-to-end rationale extraction framework for explanation regularization. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 587–602.
- Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14867–14875.
- DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2649–2664, Toronto, Canada. Association for Computational Linguistics.
- Globenc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 258–271.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359–375.
- Attcat: Explaining transformers via attentive class activation tokens. Advances in neural information processing systems, 35:5052–5064.
- Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
- Studying how to efficiently and effectively guide models with explanations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1922–1933.
- Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human Rationales. In Findings of the Association for Computational Linguistics: NAACL 2024. Association for Computational Linguistics.
- Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. In International conference on machine learning, pages 8116–8126. PMLR.
- Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 2662–2670.
- Sofia Serrano and Noah A Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951.
- Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Supervising model attention with human explanations for robust natural language inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11349–11357.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR.
- Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Visfis: Visual feature importance supervision with right-for-the-right-reason objectives. In Advances in Neural Information Processing Systems.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
- Omar Zaidan and Jason Eisner. 2008. Modeling annotators: A generative approach to learning from annotator rationales. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 31–40, Honolulu, Hawaii. Association for Computational Linguistics.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.