Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explainability for Large Language Models: A Survey (2309.01029v3)

Published 2 Sep 2023 in cs.CL, cs.AI, and cs.LG
Explainability for Large Language Models: A Survey

Abstract: LLMs have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based LLMs. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.

Explainability for LLMs: A Survey

The paper "Explainability for LLMs: A Survey" provides a structured taxonomy and overview of techniques for explaining Transformer-based LLMs. This is undertaken in light of the fact that while LLMs such as BERT, GPT-3, and GPT-4 have demonstrated outstanding capabilities in diverse natural language processing tasks, the complexity of their inner workings continues to pose potential risks when deployed in downstream applications. The opaque nature of LLMs necessitates critical approaches to interpretability, as elucidated in the paper.

Taxonomy and Training Paradigms

The authors categorize explainability strategies based on two principal LLM training paradigms: the traditional fine-tuning-based paradigm and the prompting-based paradigm. In the fine-tuning paradigm, models are pre-trained on a broad corpus and then fine-tuned for specific tasks, while the prompting paradigm leverages pre-trained models to generate predictions through contextual prompts without additional downstream training.

Local and Global Explanation Techniques

For each paradigm, methods for local (instance-specific) and global explanations are reviewed. Local explanation techniques discussed include feature attribution methods (perturbation-based and gradient-based methods), attention visualization and analysis, and instance-specific example-based explanations like adversarial examples. Conversely, global explanations aim to unveil broader behaviors of LLMs using approaches such as probing methods, neuron activation analysis, and concept-based explanations. These techniques help in identifying the linguistic properties and knowledge encoded within the models.

Usage and Future Directions

The paper also considers how explanations can aid in debugging and improving model performance. Explanation-based debugging helps identify biases such as over-reliance on spurious correlations, while explanation-based model improvement techniques can contribute to better robustness and generalization in model predictions. Furthermore, the paper explores the impact of model explainability on responsible AI practices, emphasizing the need for techniques that align with ethical guidelines and ensure reliability and transparency in model outputs.

Evaluation Challenges

Evaluating LLM explainability remains a formidable challenge. The paper discusses methods for evaluating the faithfulness and plausibility of explanations but acknowledges that generating universally accepted ground truths for evaluation is difficult. The paper points out that without standardized criteria, comparing the effectiveness of various explainability techniques can often be problematic.

Implications for AI Research

The targeted approach in explaining LLMs delineates both practical and theoretical implications in AI research. The proliferation of these models in sensitive applications like healthcare, finance, and legal domains underscores the urgency of developing robust explainability frameworks. Furthermore, as LLMs significantly impact content generation, their predictions must be ethical and intelligible to align with societal values.

In synthesizing current explainability techniques with potential future directions, this paper not only presents an exhaustive guide to contemporary approaches but also underlines open research challenges in areas like attention redundancy, shortcut learning, and understanding emergent capabilities of LLMs. These insights are pivotal for steering future research efforts towards truly interpretable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (208)
  1. Falcon-40b: an open large language model with state-of-the-art performance. 2023.
  2. Anthropic. Decomposing language models into understandable components? https://www.anthropic.com/index/decomposing-language-models-into-understandable-components, 2023.
  3. AnthropicAI. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  4. On the Pitfalls of Analyzing Individual Neurons in Language Models, August 2022. URL http://arxiv.org/abs/2110.07483. arXiv:2110.07483 [cs].
  5. ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns’ Semantic Properties and their Prototypicality, October 2021. URL http://arxiv.org/abs/2110.06376. arXiv:2110.06376 [cs].
  6. Marta: Leveraging human rationales for explainable text classification. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  5868–5876, 2021.
  7. Faithfulness Tests for Natural Language Explanations, June 2023. URL http://arxiv.org/abs/2305.18029. arXiv:2305.18029 [cs].
  8. Why attentions may not be interpretable? In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  25–34, 2021.
  9. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. arXiv preprint arXiv:2212.09095, 2022.
  10. Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp.  2882–2887, Virtual Event Queensland Australia, October 2021. ACM. ISBN 978-1-4503-8446-9. doi: 10.1145/3459637.3482126. URL https://dl.acm.org/doi/10.1145/3459637.3482126.
  11. "Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification, November 2022. URL http://arxiv.org/abs/2111.07367. arXiv:2111.07367 [cs].
  12. Identifying and Controlling Important Neurons in Neural Machine Translation, November 2018. URL http://arxiv.org/abs/1811.01157. arXiv:1811.01157 [cs].
  13. Yonatan Belinkov. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1):207–219, April 2022. ISSN 0891-2017, 1530-9312. doi: 10.1162/coli_a_00422. URL https://direct.mit.edu/coli/article/48/1/207/107571/Probing-Classifiers-Promises-Shortcomings-and.
  14. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1–10, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. URL https://aclanthology.org/I17-1001.
  15. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A", September 2023. URL http://arxiv.org/abs/2309.12288. arXiv:2309.12288 [cs].
  16. On attention redundancy: A comprehensive study. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2021.
  17. Deep RNNs Encode Soft Hierarchical Syntax. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  14–19, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2003. URL https://aclanthology.org/P18-2003.
  18. Prompting language models for linguistic structure. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2023. URL https://arxiv.org/abs/2211.07830.
  19. Towards monosemanticity: Decomposing language models with dictionary learning. https://transformer-circuits.pub/2023/monosemantic-features/index.html, 2023.
  20. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  21. On identifiability in transformers. arXiv preprint arXiv:1908.04211, 2019.
  22. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  23. Captum. Testing with concept activation vectors (tcav) on sensitivity classification examples and a convnet model trained on imdb dataset. https://github.com/pytorch/captum/blob/master/tutorials/TCAV_NLP.ipynb.
  24. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  25. Unirex: A unified learning framework for language model rationale extraction. In International Conference on Machine Learning, pp.  2867–2889. PMLR, 2022a.
  26. A comparative study of faithfulness metrics for model interpretability methods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5029–5038, 2022b.
  27. Transformer Interpretability Beyond Attention Visualization. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  782–791, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-66544-509-2. doi: 10.1109/CVPR46437.2021.00084. URL https://ieeexplore.ieee.org/document/9577970/.
  28. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. arXiv preprint arXiv:2309.07311, 2023a.
  29. Probing BERT in Hyperbolic Spaces, April 2021. URL http://arxiv.org/abs/2104.03869. arXiv:2104.03869 [cs].
  30. Adversarial training for improving model robustness? look at both prediction and interpretation. In The 36th AAAI Conference on Artificial Intelligence (AAAI), 2022.
  31. Rev: information-theoretic evaluation of free-text rationales. The 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023b.
  32. Algorithms to estimate Shapley value feature attributions. Nature Machine Intelligence, 5(6):590–601, May 2023c. ISSN 2522-5839. doi: 10.1038/s42256-023-00657-x. URL https://www.nature.com/articles/s42256-023-00657-x.
  33. Do models explain themselves? counterfactual simulatability of natural language explanations, 2023d.
  34. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  35. PaLM: Scaling Language Modeling with Pathways, October 2022. URL http://arxiv.org/abs/2204.02311. arXiv:2204.02311 [cs].
  36. Improving the Faithfulness of Attention-based Explanations with Task-specific Information for Text Classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  477–488, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.40. URL https://aclanthology.org/2021.acl-long.40.
  37. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  276–286, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://aclanthology.org/W19-4828.
  38. Electra: Pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations (ICLR), 2020.
  39. What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6309–6317, July 2019. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v33i01.33016309. URL https://ojs.aaai.org/index.php/AAAI/article/view/4592.
  40. Framework for Evaluating Faithfulness of Local Explanations. 2022.
  41. Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models, September 2020. URL http://arxiv.org/abs/2009.07053. arXiv:2009.07053 [cs].
  42. Bert: Pre-training of deep bidirectional transformers for language understanding. North American Chapter of the Association for Computational Linguistics (NAACL), 2019a.
  43. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019b.
  44. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4443–4458, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.408. URL https://aclanthology.org/2020.acl-main.408.
  45. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  46. Deep Biaffine Attention for Neural Dependency Parsing, March 2017. URL http://arxiv.org/abs/1611.01734. arXiv:1611.01734 [cs].
  47. Techniques for interpretable machine learning. Communications of the ACM, 63(1):68–77, 2019a.
  48. On Attribution of Recurrent Neural Network Predictions via Additive Decomposition, March 2019b. URL http://arxiv.org/abs/1903.11245. arXiv:1903.11245 [cs].
  49. Towards interpreting and mitigating shortcut learning behavior of nlu models. North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
  50. Shortcut learning of large language models in natural language understanding. Communications of the ACM (CACM), 2023.
  51. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023.
  52. On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5271–5285, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.387. URL https://aclanthology.org/2022.naacl-main.387.
  53. A Mathematical Framework for Transformer Circuits — transformer-circuits.pub. https://transformer-circuits.pub/2021/framework/index.html, December 2021. [Accessed 27-11-2023].
  54. Joseph Enguehard. Sequential Integrated Gradients: a simple but effective method for explaining language models, May 2023. URL http://arxiv.org/abs/2305.15853. arXiv:2305.15853 [cs].
  55. Attention Flows are Shapley Value Explanations, May 2021. URL http://arxiv.org/abs/2105.14652. arXiv:2105.14652 [cs].
  56. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  3719–3728, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1407. URL https://aclanthology.org/D18-1407.
  57. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6174–6181, 2020. doi: 10.18653/v1/2020.emnlp-main.498. URL http://arxiv.org/abs/2004.01970. arXiv:2004.01970 [cs].
  58. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  59. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680, 2022.
  60. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pp.  2242–2251. PMLR, 2019.
  61. Roscoe: A suite of metrics for scoring step-by-step reasoning. arXiv preprint arXiv:2212.07919, 2022.
  62. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
  63. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  64. Fastif: Scalable influence functions for efficient model interpretation and debugging. arXiv preprint arXiv:2012.15781, 2020.
  65. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
  66. Self-Attention Attribution: Interpreting Information Interactions Inside Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14):12963–12971, May 2021. ISSN 2374-3468. doi: 10.1609/aaai.v35i14.17533. URL https://ojs.aaai.org/index.php/AAAI/article/view/17533. Number: 14.
  67. Data cleansing for models trained with sgd. Advances in Neural Information Processing Systems, 32, 2019.
  68. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021.
  69. Scaling Laws and Interpretability of Learning from Repeated Data, May 2022. URL http://arxiv.org/abs/2205.10487. arXiv:2205.10487 [cs].
  70. Designing and Interpreting Probes with Control Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275.
  71. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/N19-1419.
  72. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  187–196, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.22. URL https://aclanthology.org/2020.acl-demos.22.
  73. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023.
  74. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4198–4205, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
  75. Attention is not Explanation, May 2019. URL http://arxiv.org/abs/1902.10186. arXiv:1902.10186 [cs].
  76. VisQA: X-raying Vision and Language Reasoning in Transformers, July 2021. URL http://arxiv.org/abs/2104.00926. arXiv:2104.00926 [cs].
  77. What Does BERT Learn about the Structure of Language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3651–3657, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1356. URL https://aclanthology.org/P19-1356.
  78. Is bert really robust? natural language attack on text classification and entailment. AAAI Conference on Artificial Intelligence (AAAI), 2020.
  79. Er-test: Evaluating explanation regularization methods for language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  3315–3336, 2022.
  80. Large Language Models Struggle to Learn Long-Tail Knowledge, July 2023. URL http://arxiv.org/abs/2211.08411. arXiv:2211.08411 [cs].
  81. Impact of Co-occurrence on Factual Knowledge of Large Language Models, October 2023. URL http://arxiv.org/abs/2310.08256. arXiv:2310.08256 [cs] version: 1.
  82. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data. 2020.
  83. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), June 2018. URL http://arxiv.org/abs/1711.11279. arXiv:1711.11279 [stat].
  84. The (Un)reliability of saliency methods, November 2017. URL http://arxiv.org/abs/1711.00867. arXiv:1711.00867 [cs, stat].
  85. Understanding black-box predictions via influence functions. In International conference on machine learning, pp.  1885–1894. PMLR, 2017.
  86. BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp.  16–21, Online, April 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.hackashop-1.3.
  87. Revealing the Dark Secrets of BERT, September 2019. URL http://arxiv.org/abs/1908.08593. arXiv:1908.08593 [cs, stat].
  88. Are large language models post hoc explainers? arXiv preprint arXiv:2310.05797, 2023.
  89. Classifier Probes May Just Learn from Linear Context Features. In Proceedings of the 28th International Conference on Computational Linguistics, pp.  5136–5146, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.450. URL https://aclanthology.org/2020.coling-main.450.
  90. When Do Pre-Training Biases Propagate to Downstream Tasks? A Case Study in Text Summarization. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3206–3219, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.234. URL https://aclanthology.org/2023.eacl-main.234.
  91. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  537–563, 2022.
  92. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
  93. Xmd: An end-to-end framework for interactive explanation-based debugging of nlp models, 2022.
  94. How is BERT surprised? Layerwise detection of linguistic anomalies, May 2021a. URL http://arxiv.org/abs/2105.07452. arXiv:2105.07452 [cs].
  95. Contextualized Perturbation for Textual Adversarial Attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5053–5069, Online, 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.400. URL https://aclanthology.org/2021.naacl-main.400.
  96. Probing via Prompting, July 2022. URL http://arxiv.org/abs/2207.01736. arXiv:2207.01736 [cs].
  97. Visualizing and Understanding Neural Models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  681–691, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https://aclanthology.org/N16-1082.
  98. Understanding Neural Networks through Representation Erasure, January 2017. URL http://arxiv.org/abs/1612.08220. arXiv:1612.08220 [cs].
  99. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149, 2023a.
  100. Towards understanding in-context learning with contrastive demonstrations and saliency maps, 2023b.
  101. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
  102. Open sesame: getting inside bert’s linguistic knowledge. arXiv preprint arXiv:1906.01698, 2019.
  103. A chatgpt aided explainable framework for zero-shot medical image diagnosis, 2023a.
  104. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023b.
  105. Are interpretations fairly evaluated? a definition driven pipeline for post-hoc interpretability. arXiv preprint arXiv:2009.07494, 2020.
  106. Rethinking Attention-Model Explainability through Faithfulness Violation Test. 2022.
  107. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  108. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017a.
  109. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017b. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.
  110. A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions. In Proceedings of the 39th International Conference on Machine Learning, pp.  14485–14508. PMLR, June 2022. URL https://proceedings.mlr.press/v162/lundstrom22a.html. ISSN: 2640-3498.
  111. Local Interpretations for Explainable Natural Language Processing: A Survey, October 2022. URL http://arxiv.org/abs/2103.11072. arXiv:2103.11072 [cs].
  112. Post hoc explanations of language models can improve language models. arXiv preprint arXiv:2305.11426, 2023.
  113. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
  114. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  115. Sammy Martin. Ten Levels of AI Alignment Difficulty, 2023. URL https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty. [Accessed 21-08-2023].
  116. Targeted Syntactic Evaluation of Language Models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1192–1202, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1151. URL https://aclanthology.org/D18-1151.
  117. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  14867–14875, 2021.
  118. Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing, June 2021. URL http://arxiv.org/abs/2106.02559. arXiv:2106.02559 [cs].
  119. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
  120. Sources of Hallucination by Large Language Models on Inference Tasks, October 2023. URL http://arxiv.org/abs/2305.14552. arXiv:2305.14552 [cs].
  121. Yusuf Mehdi. Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web, 2023. URL https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/.
  122. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  123. Investigating Saturation Effects in Integrated Gradients, October 2020. URL http://arxiv.org/abs/2010.12697. arXiv:2010.12697 [cs].
  124. Towards Transparent and Explainable Attention Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4206–4216, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.387. URL https://aclanthology.org/2020.acl-main.387.
  125. Exploring the Role of BERT Token Representations to Explain Sentence Probing Results, September 2021. URL http://arxiv.org/abs/2104.01477. arXiv:2104.01477 [cs].
  126. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pp.  193–209, 2019.
  127. Explaining NonLinear Classification Decisions with Deep Taylor Decomposition, December 2015. URL http://arxiv.org/abs/1512.02479. arXiv:1512.02479 [cs, stat].
  128. Measuring and improving faithfulness of attention in neural machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  2791–2802, 2021.
  129. Compositional Explanations of Neurons, February 2021. URL http://arxiv.org/abs/2006.14032. arXiv:2006.14032 [cs, stat].
  130. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  131. Order in the court: Explainable ai methods prone to disagreement. arXiv preprint arXiv:2105.03287, 2021.
  132. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  133. Zoom In: An Introduction to Circuits. https://distill.pub/2020/circuits/zoom-in/, March 2020a. [Accessed 24-11-2023].
  134. Naturally Occurring Equivariance in Neural Networks — distill.pub. https://distill.pub/2020/circuits/equivariance/, December 2020b. [Accessed 27-11-2023].
  135. In-context Learning and Induction Heads — transformer-circuits.pub. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html, March 2022. [Accessed 27-11-2023].
  136. OpenAI. Language models can explain neurons in language models, 2023a. URL https://openai.com/research/language-models-can-explain-neurons-in-language-models?s=09.
  137. OpenAI. Gpt-4 technical report, 2023b.
  138. SANVis: Visual Analytics for Understanding Self-Attention Networks, September 2019. URL http://arxiv.org/abs/1909.09595. arXiv:1909.09595 [cs].
  139. Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1499–1509, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1179. URL https://aclanthology.org/D18-1179.
  140. Language Models as Knowledge Bases?, September 2019. URL http://arxiv.org/abs/1909.01066. arXiv:1909.01066 [cs].
  141. Weight Banding — distill.pub. https://distill.pub/2020/circuits/weight-banding/, April 2021. [Accessed 27-11-2023].
  142. Receval: Evaluating reasoning chains via correctness and informativeness. arXiv preprint arXiv:2304.10703, 2023.
  143. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
  144. Resisting Out-of-Distribution Data Problem in Perturbation of XAI, July 2021. URL http://arxiv.org/abs/2107.14000. arXiv:2107.14000 [cs].
  145. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  146. Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768, 2023.
  147. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  148. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4932–4942, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1487. URL https://aclanthology.org/P19-1487.
  149. On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pp.  88–102, Barcelona, Spain (Online), December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.starsem-1.10.
  150. Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation, July 2023. URL http://arxiv.org/abs/2307.11019. arXiv:2307.11019 [cs].
  151. "Why Should I Trust You?": Explaining the Predictions of Any Classifier, August 2016. URL http://arxiv.org/abs/1602.04938. arXiv:1602.04938 [cs, stat].
  152. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2021.
  153. Explaining nlp models via minimal contrastive editing (mice). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3840–3852, 2021.
  154. Discretized Integrated Gradients for Explaining Language Models, August 2021. URL http://arxiv.org/abs/2108.13654. arXiv:2108.13654 [cs].
  155. Naomi Saphra. Interpretability Creationism. https://nsaphra.github.io/post/creationism/, 2022. [Accessed 22-10-2023].
  156. Is Attention Interpretable?, June 2019. URL http://arxiv.org/abs/1906.03731. arXiv:1906.03731 [cs].
  157. Lloyd S Shapley et al. A value for n-person games. 1953.
  158. An Interpretability Evaluation Benchmark for Pre-trained Language Models, July 2022. URL http://arxiv.org/abs/2207.13948. arXiv:2207.13948 [cs].
  159. Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  865–878, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.71. URL https://aclanthology.org/2021.acl-long.71.
  160. Explaining black box text modules in natural language with language models. arXiv preprint arXiv:2305.09863, 2023.
  161. Probing for Referential Information in Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4177–4189, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.384. URL https://aclanthology.org/2020.acl-main.384.
  162. Supervising model attention with human explanations for robust natural language inference. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  11349–11357, 2022.
  163. Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models, October 2018. URL http://arxiv.org/abs/1804.09299. arXiv:1804.09299 [cs].
  164. Axiomatic attribution for deep networks. International Conference on Machine Learning (ICML), 2017.
  165. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  166. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4593–4601, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclanthology.org/P19-1452.
  167. What do you learn from context? Probing for sentence structure in contextualized word representations. 2019b.
  168. Generating token-level explanations for natural language inference. arXiv preprint arXiv:1904.10717, 2019.
  169. Intrinsic Probing through Dimension Selection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  197–216, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.15. URL https://aclanthology.org/2020.emnlp-main.15.
  170. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  171. Llama-2: Open foundation and fine-tuned chat models. 2023b. URL https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.
  172. Crest: A joint framework for rationalization and counterfactual text generation. The 61th Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
  173. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388, 2023.
  174. How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp.  1823–1832, Beijing China, November 2019. ACM. ISBN 978-1-4503-6976-3. doi: 10.1145/3357384.3358028. URL https://dl.acm.org/doi/10.1145/3357384.3358028.
  175. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
  176. Jesse Vig. BertViz: A Tool for Visualizing Multi-Head Self-Attention in the BERT Model. 2019.
  177. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5797–5808, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580.
  178. Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1126–1140, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.91. URL https://aclanthology.org/2021.acl-long.91.
  179. Branch Specialization — distill.pub. https://distill.pub/2020/circuits/branch-specialization/, April 2021. [Accessed 26-11-2023].
  180. Glue: A multi-task benchmark and analysis platform for natural language understanding. International Conference on Learning Representations (ICLR), 2019.
  181. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001, 2022a.
  182. SemAttack: Natural Textual Attacks via Different Semantic Spaces. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  176–205, Seattle, United States, 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.14. URL https://aclanthology.org/2022.findings-naacl.14.
  183. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  184. Simple synthetic data reduces sycophancy in large language models, August 2023a. URL http://arxiv.org/abs/2308.03958. arXiv:2308.03958 [cs].
  185. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023b.
  186. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  187. Attention is not not Explanation, September 2019. URL http://arxiv.org/abs/1908.04626. arXiv:1908.04626 [cs].
  188. Analyzing chain-of-thought prompting in large language models via gradient-based feature attributions. 2023a.
  189. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6707–6723, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.523. URL https://aclanthology.org/2021.acl-long.523.
  190. Do PLMs Know and Understand Ontological Knowledge? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3080–3101, Toronto, Canada, 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.173. URL https://aclanthology.org/2023.acl-long.173.
  191. From language modeling to instruction following: Understanding the behavior shift in llms after instruction tuning. arXiv preprint arXiv:2310.00492, 2023c.
  192. On Explaining Your Explanations of BERT: An Empirical Study with Sequence Classification, January 2021. URL http://arxiv.org/abs/2101.00196. arXiv:2101.00196 [cs].
  193. Structured Self-Attention Weights Encode Semantics in Sentiment Analysis. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  255–264, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.24. URL https://aclanthology.org/2020.blackboxnlp-1.24.
  194. Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4166–4176, Online, July 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.383. URL https://aclanthology.org/2020.acl-main.383.
  195. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023.
  196. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392, 2022.
  197. Xi Ye and Greg Durrett. Explanation selection using unlabeled data for chain-of-thought prompting, 2023.
  198. AttentionViz: A Global View of Transformer Attention, May 2023. URL http://arxiv.org/abs/2305.03210. arXiv:2305.03210 [cs].
  199. Representer point selection for explaining deep neural networks. Advances in neural information processing systems, 31, 2018.
  200. Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning. arXiv preprint arXiv:2306.01150, 2023.
  201. Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations in a Label-Abundant Setup. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  3486–3501, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.255.
  202. Post-hoc Concept Bottleneck Models, February 2023. URL http://arxiv.org/abs/2205.15480. arXiv:2205.15480 [cs, stat].
  203. Probing GPT-3’s Linguistic Knowledge on Semantic Tasks. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  297–304, Abu Dhabi, United Arab Emirates (Hybrid), December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.blackboxnlp-1.24.
  204. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
  205. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models, September 2023. URL http://arxiv.org/abs/2309.01219. arXiv:2309.01219 [cs].
  206. Factual Probing Is [MASK]: Learning vs. Learning to Recall, December 2021. URL http://arxiv.org/abs/2104.05240. arXiv:2104.05240 [cs].
  207. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  208. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haiyan Zhao (42 papers)
  2. Hanjie Chen (28 papers)
  3. Fan Yang (877 papers)
  4. Ninghao Liu (98 papers)
  5. Huiqi Deng (12 papers)
  6. Hengyi Cai (20 papers)
  7. Shuaiqiang Wang (68 papers)
  8. Dawei Yin (165 papers)
  9. Mengnan Du (90 papers)
Citations (274)
Youtube Logo Streamline Icon: https://streamlinehq.com