Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers (2405.13536v2)

Published 22 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We address the critical challenge of applying feature attribution methods to the transformer architecture, which dominates current applications in natural language processing and beyond. Traditional attribution methods to explainable AI (XAI) explicitly or implicitly rely on linear or additive surrogate models to quantify the impact of input features on a model's output. In this work, we formally prove an alarming incompatibility: transformers are structurally incapable of representing linear or additive surrogate models used for feature attribution, undermining the grounding of these conventional explanation methodologies. To address this discrepancy, we introduce the Softmax-Linked Additive Log Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework. SLALOM demonstrates the capacity to deliver a range of insightful explanations with both synthetic and real-world datasets. We highlight SLALOM's unique efficiency-quality curve by showing that SLALOM can produce explanations with substantially higher fidelity than competing surrogate models or provide explanations of comparable quality at a fraction of their computational costs. We release code for SLALOM as an open-source project online at https://github.com/tleemann/slalom_explanations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. S. Abnar and W. Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020.
  2. A. Adadi and M. Berrada. Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE access, 2018.
  3. Towards the unification and robustness of perturbation and gradient based explanations. In International Conference on Machine Learning, pages 110–119. PMLR, 2021.
  4. N. Asghar. Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362, 2016.
  5. How to explain individual classification decisions. The Journal of Machine Learning Research, 11:1803–1831, 2010.
  6. Almost linear vc dimension bounds for piecewise polynomial networks. Advances in neural information processing systems, 11, 1998.
  7. S. Bordt and U. von Luxburg. From shapley values to generalized additive models and back. In International Conference on Artificial Intelligence and Statistics, pages 709–745. PMLR, 2023.
  8. On identifiability in transformers. 2020.
  9. N. Burkart and M. F. Huber. A survey on the explainability of supervised machine learning. Journal of Artificial Intelligence Research, 70:245–317, 2021.
  10. Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726–1730, 2009.
  11. Neural legal judgment prediction in english. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, 2019.
  12. Generating hierarchical explanations on text classification via feature interaction detection. arXiv preprint arXiv:2004.02015, 2020.
  13. What i cannot predict, i do not understand: A human-centered evaluation framework for explainability methods. Advances in neural information processing systems, 35:2832–2845, 2022.
  14. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, 2020.
  17. Evaluation of lime and shap in explaining automatic icd-10 classifications of swedish gastrointestinal discharge summaries. In Scandinavian Conference on Health Informatics, pages 166–173, 2022.
  18. J. Ferrando and M. R. Costa-jussà. Attention weights in transformer nmt fail aligning words between sequences but largely explain model predictions. arXiv preprint arXiv:2109.05853, 2021.
  19. D. Garreau and U. von Luxburg. Explaining the explainer: A first theoretical analysis of lime. In International conference on artificial intelligence and statistics, pages 1287–1296. PMLR, 2020.
  20. Explaining explanations: An overview of interpretability of machine learning. In DSAA, 2018.
  21. A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  22. Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in Neural Information Processing Systems, 35:5256–5268, 2022.
  23. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12963–12971, 2021.
  24. Huggingface. Hugging face - models: Most downloaded sequence classification models, 2023. URL https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads.
  25. S. Jain and B. C. Wallace. Attention is not explanation. arXiv preprint arXiv:1902.10186, 2019.
  26. Chatgpt makes medicine easy to swallow: an exploratory case study on simplified radiology reports. European radiology, pages 1–9, 2023.
  27. G. Kasneci and T. Gottron. Licon: A linear weighting scheme for the contribution ofinput variables in deep artificial neural networks. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 45–54, 2016.
  28. Attention is not only a weight: Analyzing transformers with vector norms. In 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 7057–7075. Association for Computational Linguistics (ACL), 2020.
  29. Feed-forward blocks control contextualization in masked language models. arXiv preprint arXiv:2302.00456, 2023.
  30. Bloom: A 176b-parameter open-access multilingual language model. 2023.
  31. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  32. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
  33. Bounding the estimation error of sampling-based shapley value approximation. arXiv preprint arXiv:1306.4265, 2013.
  34. Globenc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 258–271, 2022.
  35. DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2649–2664, 2023.
  36. C. Molnar. Interpretable Machine Learning. 2019.
  37. Attcat: Explaining transformers via attentive class activation tokens. Advances in Neural Information Processing Systems, 35:5052–5064, 2022.
  38. Improving language understanding by generative pre-training. 2018.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
  41. A consistent and efficient evaluation strategy for attribution methods. In International Conference on Machine Learning, pages 18770–18795. PMLR, 2022.
  42. Towards human-centered explainable ai: A survey of user studies for model explanations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  43. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  44. Uncovering trauma in genocide tribunals: An nlp approach using the genocide transcript corpus. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 257–266, 2023.
  45. Human attention maps for text classification: Do humans and neural networks focus on the same words? In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4596–4608, 2020.
  46. S. Serrano and N. A. Smith. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  47. Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR, 2017.
  48. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  49. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
  50. Efficient shapley values calculation for transformer explainability. In Asian Conference on Pattern Recognition, pages 54–67. Springer, 2023.
  51. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
  52. New explainability method for bert-based model in fake news detection. Scientific reports, 11(1):23705, 2021.
  53. Sanity checks for saliency metrics. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6021–6029, 2020.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  55. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  56. S. Wiegreffe and Y. Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019.
  57. Attribution in scale and space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9680–9689, 2020.
  58. Local interpretation of transformer based on linear decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10270–10287, 2023.
  59. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023.
Citations (1)

Summary

  • The paper demonstrates that transformer attention mechanisms inherently cannot represent additive models, challenging current feature attribution practices.
  • It introduces SLALOM, a novel surrogate model that effectively captures token-level interactions and non-linearities in transformer architectures.
  • Empirical results on synthetic and real-world datasets validate SLALOM's ability to recover accurate parameter mappings and logit scores.

Insights on Feature Attribution for Transformers: Analyzing and Addressing Limitations

The paper "Attention Mechanisms Don’t Learn Additive Models: Rethinking Feature Importance for Transformers" by Leemann et al. tackles the pivotal challenge in the domain of feature attribution methods within the framework of transformer architectures. Transformers, renowned for their supremacy in natural language processing applications, underscore the need for interpretation methodologies aligned with their structural intricacies.

Key Findings and Contributions

The core finding of this paper is the intrinsic limitation of transformers in aligning with linear or additive surrogate models that traditionally serve as the backbone for feature attribution methods. The authors prove, both theoretically and empirically, that transformers inherently cannot represent additive models, including generalized additive models (GAMs) and linear models, due to the structure introduced by the attention mechanism. This revelation poses a significant challenge, casting doubt on the faithfulness of existing explanation practices in interpretablity-centric domains, such as judicial or medical settings, where LLMs are increasingly deployed.

In response to these challenges, the authors propose the Softmax-Linked Additive Log-Odds Model (SLALOM), a novel surrogate model crafted to harmonize with transformer architectures. SLALOM stands distinct from conventional models by providing a two-dimensional feature representation: the token value indicating independent contribution and the token importance indicating interaction weight vis-a-vis other tokens. By accommodating non-linearities and interactions, SLALOM transcends the capabilities of existing approaches.

Empirical validation across synthetic and real-world datasets highlights SLALOM's superior performance in delivering faithful explanations. The model demonstrates robustness across diverse tasks, underlining the necessity for tailored feature attributions rather than a monolithic approach.

Theoretical and Empirical Analysis

The paper comprehensively outlines how common transformer architectures fail to embody GAMs and linear models. The authors achieve this by examining the transformer’s attention mechanism, which typically normalizes dependencies across the entire token sequence, preventing additive functions from being accurately represented.

Corroborating the theoretical framework, empirical experiments illustrate that common transformers, irrespective of layer depth, inadequately capture linear relationships expressed in synthetic datasets designed with linear log-odds systems. Such limitations starkly contrast with fully connected models that successfully recover these linear relationships.

In exploring the potential of SLALOM, the research identifies its capacity for efficient recovery and representation of transformer model outputs. Experiments indicate that SLALOM not only recovers the true parameter mappings from transformers trained on SLALOM-generated data but also achieves an impressive recovery of logit scores approximating ground truth.

Implications and Future Directions

The implications of this work are multi-faceted. By illuminating the structural limitations of transformers in learning additive models, the paper signals a crucial oversight in existing XAI practices. The findings urge a re-evaluation of current interpretability methodologies, particularly for applications in high-stakes domains where model transparency is paramount.

Practically, the integration of SLALOM in interpretability pipelines may pave the way for more nuanced and reliable feature attribution mechanisms. Researchers and practitioners are encouraged to explore task-specific feature attributions, which promise enhanced explanatory power in line with the diverse capabilities of LLMs.

From a theoretical perspective, the paper opens avenues for exploring alternative surrogate models capable of capturing the complex interactions endemic to transformer-generated data. Refining SLALOM and pursuing other innovative models tailored to different architectures could significantly advance our understanding and application of XAI frameworks.

In conclusion, Leemann et al.'s work contributes a crucial paradigm shift in how we comprehend and apply feature attribution in transformer models, directing the community toward more accurate and insightful interpretability solutions.

Youtube Logo Streamline Icon: https://streamlinehq.com