Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers (2405.13536v1)
Abstract: We address the critical challenge of applying feature attribution methods to the transformer architecture, which dominates current applications in natural language processing and beyond. Traditional attribution methods to explainable AI (XAI) explicitly or implicitly rely on linear or additive surrogate models to quantify the impact of input features on a model's output. In this work, we formally prove an alarming incompatibility: transformers are structurally incapable to align with popular surrogate models for feature attribution, undermining the grounding of these conventional explanation methodologies. To address this discrepancy, we introduce the Softmax-Linked Additive Log-Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework. Unlike existing methods, SLALOM demonstrates the capacity to deliver a range of faithful and insightful explanations across both synthetic and real-world datasets. Showing that diverse explanations computed from SLALOM outperform common surrogate explanations on different tasks, we highlight the need for task-specific feature attributions rather than a one-size-fits-all approach.
- S. Abnar and W. Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020.
- A. Adadi and M. Berrada. Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE access, 2018.
- Towards the unification and robustness of perturbation and gradient based explanations. In International Conference on Machine Learning, pages 110–119. PMLR, 2021.
- N. Asghar. Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362, 2016.
- How to explain individual classification decisions. The Journal of Machine Learning Research, 11:1803–1831, 2010.
- Almost linear vc dimension bounds for piecewise polynomial networks. Advances in neural information processing systems, 11, 1998.
- S. Bordt and U. von Luxburg. From shapley values to generalized additive models and back. In International Conference on Artificial Intelligence and Statistics, pages 709–745. PMLR, 2023.
- On identifiability in transformers. 2020.
- N. Burkart and M. F. Huber. A survey on the explainability of supervised machine learning. Journal of Artificial Intelligence Research, 70:245–317, 2021.
- Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726–1730, 2009.
- Neural legal judgment prediction in english. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, 2019.
- Generating hierarchical explanations on text classification via feature interaction detection. arXiv preprint arXiv:2004.02015, 2020.
- What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods. Advances in neural information processing systems, 35:2832–2845, 2022.
- Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, 2020.
- Evaluation of lime and shap in explaining automatic icd-10 classifications of swedish gastrointestinal discharge summaries. In Scandinavian Conference on Health Informatics, pages 166–173, 2022.
- J. Ferrando and M. R. Costa-jussà. Attention weights in transformer nmt fail aligning words between sequences but largely explain model predictions. arXiv preprint arXiv:2109.05853, 2021.
- D. Garreau and U. von Luxburg. Explaining the explainer: A first theoretical analysis of lime. In International conference on artificial intelligence and statistics, pages 1287–1296. PMLR, 2020.
- Explaining explanations: An overview of interpretability of machine learning. In DSAA, 2018.
- A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in Neural Information Processing Systems, 35:5256–5268, 2022.
- Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12963–12971, 2021.
- Huggingface. Hugging face - models: Most downloaded sequence classification models, 2023. URL https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads.
- S. Jain and B. C. Wallace. Attention is not explanation. arXiv preprint arXiv:1902.10186, 2019.
- Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports. European radiology, pages 1–9, 2023.
- G. Kasneci and T. Gottron. Licon: A linear weighting scheme for the contribution ofinput variables in deep artificial neural networks. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 45–54, 2016.
- Attention is not only a weight: Analyzing transformers with vector norms. In 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 7057–7075. Association for Computational Linguistics (ACL), 2020.
- Feed-forward blocks control contextualization in masked language models. arXiv preprint arXiv:2302.00456, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. 2023.
- A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
- Bounding the estimation error of sampling-based shapley value approximation. arXiv preprint arXiv:1306.4265, 2013.
- Globenc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 258–271, 2022.
- DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2649–2664, 2023.
- C. Molnar. Interpretable Machine Learning. 2019.
- Attcat: Explaining transformers via attentive class activation tokens. Advances in Neural Information Processing Systems, 35:5052–5064, 2022.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
- A consistent and efficient evaluation strategy for attribution methods. In International Conference on Machine Learning, pages 18770–18795. PMLR, 2022.
- Towards human-centered explainable ai: A survey of user studies for model explanations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Uncovering trauma in genocide tribunals: An nlp approach using the genocide transcript corpus. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 257–266, 2023.
- Human attention maps for text classification: Do humans and neural networks focus on the same words? In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4596–4608, 2020.
- S. Serrano and N. A. Smith. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR, 2017.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Efficient shapley values calculation for transformer explainability. In Asian Conference on Pattern Recognition, pages 54–67. Springer, 2023.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
- New explainability method for bert-based model in fake news detection. Scientific reports, 11(1):23705, 2021.
- Sanity checks for saliency metrics. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6021–6029, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- S. Wiegreffe and Y. Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019.
- Attribution in scale and space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9680–9689, 2020.
- Local interpretation of transformer based on linear decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10270–10287, 2023.
- Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023.