Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Feature Interactions Reveal Linguistic Structure in Language Models (2306.12181v1)

Published 21 Jun 2023 in cs.CL

Abstract: We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in LLMs. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on LLMs, providing novel insights into the linguistic structure that these models have acquired.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Explaining individual predictions when features are dependent: More accurate approximations to shapley values. Artif. Intell., 298:103502.
  2. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  3. Gradient-based attribution methods. In Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors, Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science, pages 169–191. Springer.
  4. A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 3256–3274. Association for Computational Linguistics.
  5. “will you find these shortcuts?” a protocol for evaluating the faithfulness of input salience methods for text classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 976–991, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  6. Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (Information Science and Statistics), 1 edition. Springer.
  7. Generating hierarchical explanations on text classification via feature interaction detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5578–5593. Association for Computational Linguistics.
  8. True to the model or true to the data? CoRR, abs/2006.16234.
  9. Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res., 22:209:1–209:90.
  10. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pages 598–617. IEEE Computer Society.
  11. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4443–4458. Association for Computational Linguistics.
  12. How can self-attention networks recognize Dyck-n languages? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4301–4306, Online. Association for Computational Linguistics.
  13. Improving performance of deep learning models with axiomatic attribution priors and expected gradients. Nat. Mach. Intell., 3(7):620–631.
  14. Measuring the mixing of contextual information in the transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  15. Jerome H. Friedman and Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. The annals of applied statistics, 2(3):916–954.
  16. Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  17. Deep Learning. Adaptive computation and machine learning. MIT Press.
  18. Michel Grabisch and Marc Roubens. 1999. An axiomatic approach to the concept of interaction among players in cooperative games. Int. J. Game Theory, 28(4):547–565.
  19. Yiding Hao. 2020. Evaluating attribution methods using white-box LSTMs. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 300–313, Online. Association for Computational Linguistics.
  20. RNNs can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, Online. Association for Computational Linguistics.
  21. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4129–4138. Association for Computational Linguistics.
  22. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  23. spaCy: Industrial-strength Natural Language Processing in Python.
  24. A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 9734–9745.
  25. Alon Jacovi and Yoav Goldberg. 2021. Aligning faithful interpretations with their social attribution. Trans. Assoc. Comput. Linguistics, 9:294–310.
  26. Learning to faithfully rationalize by construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4459–4473, Online. Association for Computational Linguistics.
  27. Explaining explanations: Axiomatic feature interactions for deep networks. J. Mach. Learn. Res., 22:104:1–104:54.
  28. Feature relevance quantification in explainable AI: A causal problem. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 2907–2916. PMLR.
  29. Yifan Jiang and Shane Steinert-Threlkeld. 2023. The weighted möbius score: A unified framework for feature attribution.
  30. Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  31. Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3154–3167, Online. Association for Computational Linguistics.
  32. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  33. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  34. Consistent individualized feature attribution for tree ensembles. CoRR, abs/1802.03888.
  35. Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4765–4774.
  36. Quantifying context mixing in transformers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3378–3400, Dubrovnik, Croatia. Association for Computational Linguistics.
  37. Beyond word importance: Contextual decomposition to extract interactions from lstms. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  38. A song of (dis)agreement: Evaluating the evaluation of explainable artificial intelligence in natural language processing. CoRR, abs/2205.04559.
  39. Guillermo Owen. 1972. Multilinear extensions of games. Management Science, 18(5):P64–P79.
  40. Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 340–350. Association for Computational Linguistics.
  41. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144. ACM.
  42. Does self-rationalization improve robustness to spurious correlations? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7403–7416, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  43. Naomi Saphra and Adam Lopez. 2020. LSTMs compose—and Learn—Bottom-up. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2797–2809, Online. Association for Computational Linguistics.
  44. Luzi Sennhauser and Robert C. Berwick. 2018. Evaluating the ability of lstms to learn context-free grammars. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, pages 115–124. Association for Computational Linguistics.
  45. Lloyd S. Shapley. 1953. A value for n-person games. Contributions to the Theory of Games, (28):307–317.
  46. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR.
  47. Integrated directional gradients: Feature interaction attribution for neural NLP models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 865–878, Online. Association for Computational Linguistics.
  48. Hierarchical interpretations for neural network predictions. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  49. Detecting statistical interactions with additive groves of trees. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, pages 1000–1007. ACM.
  50. The shapley taylor interaction index. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9259–9268. PMLR.
  51. Mukund Sundararajan and Amir Najmi. 2020. The many shapley values for model explanation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9269–9278. PMLR.
  52. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR.
  53. Memory-augmented recurrent neural networks can learn generalized dyck languages. CoRR, abs/1911.03329.
  54. Detecting statistical interactions from neural network weights. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  55. How does this interaction affect me? interpretable attribution for feature interactions. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  56. Can I trust you more? model-agnostic hierarchical explanations. CoRR, abs/1812.04801.
  57. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
  58. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  59. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7:625–641.
  60. Measuring association between labels and free-text rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  61. Kayo Yin and Graham Neubig. 2022. Interpreting language models with contrastive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  62. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, volume 8689 of Lecture Notes in Computer Science, pages 818–833. Springer.
Citations (7)

Summary

We haven't generated a summary for this paper yet.