Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Interpretability in the Era of Large Language Models (2402.01761v1)

Published 30 Jan 2024 in cs.CL, cs.AI, and cs.LG
Rethinking Interpretability in the Era of Large Language Models

Abstract: Interpretable machine learning has exploded as an area of interest over the last decade, sparked by the rise of increasingly large datasets and deep neural networks. Simultaneously, LLMs have demonstrated remarkable capabilities across a wide array of tasks, offering a chance to rethink opportunities in interpretable machine learning. Notably, the capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. However, these new capabilities raise new challenges, such as hallucinated explanations and immense computational costs. In this position paper, we start by reviewing existing methods to evaluate the emerging field of LLM interpretation (both interpreting LLMs and using LLMs for explanation). We contend that, despite their limitations, LLMs hold the opportunity to redefine interpretability with a more ambitious scope across many applications, including in auditing LLMs themselves. We highlight two emerging research priorities for LLM interpretation: using LLMs to directly analyze new datasets and to generate interactive explanations.

Introduction to Interpretability and LLMs

Interpretable machine learning has become an integral part of developing effective and trustworthy AI systems. With the emergence of LLMs, there is now an unprecedented opportunity to reshape the field of interpretability. LLMs, with their expansive datasets and neural networks, outperform traditional methods in complex tasks and provide natural language explanations that can communicate intricate data patterns to users. However, these advancements come with their own set of concerns such as the generation of incorrect or baseless explanations (hallucination) and substantial computational costs.

Rethinking Interpretation Methods

The paper under consideration expounds on the dual role of LLMs—both as objects of interpretability and as tools for generating explanations of other systems. Traditional techniques offer insights into predictions made by models, like feature importance, but they present limitations, especially when evaluating complex (and often opaque) LLM behaviors. Crucially, the approach of soliciting direct natural language explanations from LLMs opens the door to user-friendly interpretations without complex technical jargon. Nevertheless, to leverage this capability, one must confront new issues like ensuring the validity of LLM explanations and managing the prohibitive size of state-of-the-art models.

Challenges and New Research Avenues

The authors underscore the need to develop effective solutions to combat hallucinated explanations, which can mislead users and erode trust in AI systems. Moreover, they emphasize the importance of creating accessible and efficient interpretability methods for LLMs that have grown beyond the capacity for conventional analysis techniques. The paper neatly categorizes research into explaining a single output from an LLM (local explanation) and understanding the LLM as a whole (global or mechanistic explanation). Notably, cutting-edge LLMs can integrate explanation directly within the generation process, yielding more faithful and accurate reasoning through techniques such as chain-of-thought prompting. Another focal area is dataset explanation, where LLMs help analyze and elucidate patterns within datasets, potentially transforming areas like scientific discovery and data analysis.

Future Priorities and Conclusion

The paper concludes by spotlighting matters vital to advancing interpretability research. These include bolstering explanation reliability, fostering dataset explanation for genuine knowledge discovery, and developing interactive explanations that align with specific user needs. The future trajectory of LLMs in interpretability hinges on addressing these challenges; strategic emphasis in these areas could accelerate the progression towards reliable, user-oriented explanations. As the complexity of available information grows, so too does the significance of LLMs in translating this complexity into comprehensible insights, promising a new chapter in the synergy between AI and human understanding.

Overall, this paper illustrates not merely incremental improvements but a paradigm shift in how we conceptualize and leverage interpretability in the age of LLMs, with vast implications for the broader AI industry and numerous high-stakes domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (161)
  1. A roadmap for a rigorous science of interpretability. arXiv preprint arXiv:1702.08608, 2017.
  2. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44):22071–22080, 2019.
  3. Christoph Molnar. Interpretable machine learning. Lulu. com, 2019.
  4. Interpretable machine learning: Fundamental principles and 10 grand challenges. arXiv preprint arXiv:2103.11251, 2021.
  5. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  6. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4768–4777, 2017.
  7. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  8. GAN dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018.
  9. Distill-and-Compare: Auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 303–310, 2018.
  10. Adaptive wavelet distillation from neural networks through interpretations. Advances in Neural Information Processing Systems, 34:20669–20682, 2021.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  13. OpenAI. GPT-4 technical report, 2023.
  14. European union regulations on algorithmic decision-making and a” right to explanation”. arXiv preprint arXiv:1606.08813, 2016.
  15. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
  16. Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
  17. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
  18. ChatGPT for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
  19. Can large language models transform computational social science? arXiv preprint arXiv:2305.03514, 2023.
  20. Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029, 2023.
  21. Toward transparent AI: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483. IEEE, 2023.
  22. Science in the age of large language models. Nature Reviews Physics, pages 1–4, 2023.
  23. Learning from learning machines: a new generation of AI technology to meet the needs of science. arXiv preprint arXiv:2111.13786, 2021.
  24. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
  27. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279, 2017.
  28. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pages 9505–9515, 2018.
  29. On evaluating explanation utility for Human-AI decision-making in NLP. In XAI in Action: Past, Present, and Future Applications, 2023.
  30. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2021.
  31. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  32. Language models can explain neurons in language models, 2023.
  33. Explaining black box text modules in natural language with language models. arXiv preprint arXiv:2305.09863, 2023.
  34. FIND: A function description benchmark for evaluating interpretability methods. arXiv e-prints, pages arXiv–2309, 2023.
  35. Goal driven discovery of distributional differences via language descriptions. ArXiv, abs/2302.14233, 2023.
  36. Faithfulness tests for natural language explanations. arXiv preprint arXiv:2305.18029, 2023.
  37. On measuring faithfulness of natural language explanations. arXiv preprint arXiv:2311.07466, 2023.
  38. Rev: information-theoretic evaluation of free-text rationales. arXiv preprint arXiv:2210.04982, 2022.
  39. Post hoc explanations of language models can improve language models. arXiv preprint arXiv:2305.11426, 2023.
  40. Orca: Progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707, 2023.
  41. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
  42. Xi Ye and Greg Durrett. Explanation selection using unlabeled data for chain-of-thought prompting. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 619–637, 2023.
  43. Shortcut learning of large language models in natural language understanding. Communications of the ACM (CACM), 2023.
  44. Impact of co-occurrence on factual knowledge of large language models. arXiv preprint arXiv:2310.08256, 2023.
  45. ”Will you find these shortcuts?” a protocol for evaluating the faithfulness of input salience methods for text classification. arXiv preprint arXiv:2111.07367, 2021.
  46. Locating and editing factual knowledge in GPT. arXiv preprint arXiv:2202.05262, 2022.
  47. Fast model editing at scale, 2022.
  48. Inspecting and editing knowledge representations in language models, 2023.
  49. Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In Proceedings of the 2020 CHI conference on human factors in computing systems, pages 1–14, 2020.
  50. The challenge of crafting intelligible intelligence. Communications of the ACM, 62(6):70–79, 2019.
  51. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  52. Multi-modal molecule structure-text model for text-based retrieval and editing. ArXiv, abs/2212.10789, 2022.
  53. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  54. Rethinking explainability as a dialogue: A practitioner’s perspective. arXiv preprint arXiv:2202.01875, 2022.
  55. Axiomatic attribution for deep networks. ICML, 2017.
  56. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
  57. Integrated directional gradients: Feature interaction attribution for neural nlp models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 865–878, 2021.
  58. Joseph Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models. arXiv preprint arXiv:2305.15853, 2023.
  59. Algorithms to estimate shapley value feature attributions. Nature Machine Intelligence, pages 1–12, 2023.
  60. Interpretation of nlp models through input marginalization. arXiv preprint arXiv:2010.13984, 2020.
  61. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019.
  62. Attention is not explanation. arXiv preprint arXiv:1902.10186, 2019.
  63. Are large language models post hoc explainers? arXiv preprint arXiv:2310.05797, 2023.
  64. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31, 2018.
  65. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019.
  66. Generating visual explanations. In European conference on computer vision, pages 3–19. Springer, 2016.
  67. LLMs as counterfactual explanation modules: Can ChatGPT explain black-box text classifiers? arXiv preprint arXiv:2309.13340, 2023.
  68. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. arXiv preprint arXiv:2306.13063, 2023.
  69. Quantifying uncertainty in natural language explanations of large language models. arXiv preprint arXiv:2311.03533, 2023.
  70. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty, 2024.
  71. Do models explain themselves? Counterfactual simulatability of natural language explanations. arXiv preprint arXiv:2307.08678, 2023.
  72. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392, 2022.
  73. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  74. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
  75. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001, 2022.
  76. Measuring faithfulness in chain-of-thought reasoning. ArXiv, abs/2307.13702, 2023.
  77. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  78. Graph of Thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  79. Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114, 2021.
  80. Measuring and narrowing the compositionality gap in language models, 2022.
  81. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  82. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
  83. REALM: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909, 2020.
  84. Check your facts and try again: Improving large language models with external knowledge and automated feedback. ArXiv, abs/2302.12813, 2023.
  85. Unifying corroborative and contributive attributions in large language models. arXiv preprint arXiv:2311.12233, 2023.
  86. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070, 2018.
  87. Incorporating priors with feature attribution on text classification. arXiv preprint arXiv:1906.08286, 2019.
  88. Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031, 2018.
  89. What does BERT look at? An analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
  90. Text embeddings reveal (almost) as much as text. arXiv preprint arXiv:2310.06816, 2023.
  91. Representation engineering: A top-down approach to AI transparency. ArXiv, abs/2310.01405, 2023.
  92. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
  93. PatchScope: A unifying framework for inspecting hidden representations of language models, 2024.
  94. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163, 2020.
  95. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023.
  96. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2022.
  97. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593, 2022.
  98. How do language models bind entities in context? arXiv preprint arXiv:2310.17191, 2023.
  99. Circuit component reuse across tasks in transformer language models. arXiv preprint arXiv:2310.08744, 2023.
  100. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
  101. Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
  102. Interpretability at scale: Identifying causal mechanisms in Alpaca. ArXiv, abs/2305.08809, 2023.
  103. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  104. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  105. In-context language learning: Arhitectures and algorithms, 2024.
  106. What can transformers learn in-context? A case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  107. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
  108. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
  109. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR, 2023.
  110. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
  111. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552, 2023.
  112. XMD: An end-to-end framework for interactive explanation-based debugging of nlp models. arXiv preprint arXiv:2210.16978, 2022.
  113. Talktomodel: Understanding machine learning models with open ended dialogues. arXiv preprint arXiv:2207.04154, 2022.
  114. LLMCheckup: Conversational examination of large language models via interpretability tools. arXiv preprint arXiv:2401.12576, 2024.
  115. Tell your model where to attend: Post-hoc attention steering for LLMs. arXiv preprint arXiv:2311.02262, 2023.
  116. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558, 2023.
  117. Victor Dibia. Lida: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. arXiv preprint arXiv:2303.02927, 2023.
  118. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.
  119. Benchmarking large language models as AI research agents. arXiv preprint arXiv:2310.03302, 2023.
  120. Table-GPT: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263, 2023.
  121. Towards foundation models for learning on tabular data. arXiv preprint arXiv:2310.07338, 2023.
  122. Generative table pre-training empowers models for tabular prediction. arXiv preprint arXiv:2305.09696, 2023.
  123. LLMs understand glass-box models, discover surprises, and suggest repairs. arXiv preprint arXiv:2308.01157, 2023.
  124. MaNtLE: Model-agnostic natural language explainer. arXiv preprint arXiv:2305.12995, 2023.
  125. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  126. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102:349–391, 2016.
  127. Generalized additive models. Statistical Science, 1(3):297–318, 1986.
  128. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623–631, 2013.
  129. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1721–1730, 2015.
  130. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984.
  131. J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  132. Hierarchical shrinkage: improving the accuracy and interpretability of tree-based methods. arXiv:2202.00858, 2 2022. arXiv: 2202.00858.
  133. imodels: a python package for fitting interpretable models. Journal of Open Source Software, 6(61):3192, 2021.
  134. Fast interpretable greedy-tree sums (figs). arXiv:2201.11931 [cs, stat], 1 2022. arXiv: 2201.11931.
  135. Augmenting interpretable models with large language models during training. Nature Communications, 14(1):7913, 2023.
  136. Chill: Zero-shot custom interpretable feature extraction from clinical notes with large language models. arXiv preprint arXiv:2302.12343, 2023.
  137. Designing LLM chains by adapting techniques from crowdsourcing workflows. arXiv preprint arXiv:2312.11681, 2023.
  138. Tree prompting: efficient task adaptation without fine-tuning. arXiv preprint arXiv:2310.14034, 2023.
  139. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  140. Self-Refine: Iterative refinement with self-feedback, 2023.
  141. Self-verification improves few-shot clinical information extraction. arXiv preprint arXiv:2306.00024, 2023.
  142. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  143. Explaining patterns in data with language models via interpretable autoprompting, 2023.
  144. Describing differences between text distributions with natural language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27099–27116. PMLR, 17–23 Jul 2022.
  145. Gsclip: A framework for explaining distribution shifts in natural language. arXiv preprint arXiv:2206.15007, 2022.
  146. Goal-driven explainable clustering via language descriptions. arXiv preprint arXiv:2305.13749, 2023.
  147. TopicGPT: A prompt-based topic modeling framework. arXiv preprint arXiv:2311.01449, 2023.
  148. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 2024.
  149. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  150. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388, 2023.
  151. Lost in the middle: How language models use long contexts. ArXiv, abs/2307.03172, 2023.
  152. Towards consistent natural-language explanations via explanation-consistency finetuning, 2024.
  153. Benchmarking and improving generator-validator consistency of language models. arXiv preprint arXiv:2310.01846, 2023.
  154. Deductive closure training of language models for coherence, accuracy, and updatability. arXiv preprint arXiv:2401.08574, 2024.
  155. Large language models for automated open-domain scientific hypotheses discovery. arXiv preprint arXiv:2309.02726, 2023.
  156. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence, 5(12):1447–1457, 2023.
  157. Mathematical discoveries from program search with large language models. Nature, 2023.
  158. Bridging the Human-AI knowledge gap: Concept discovery and transfer in alphazero. arXiv preprint arXiv:2310.16410, 2023.
  159. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654, 2024.
  160. Eliciting human preferences with language models. arXiv preprint arXiv:2310.11589, 2023.
  161. Recommender AI agent: Integrating large language models for interactive recommendations. arXiv preprint arXiv:2308.16505, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chandan Singh (42 papers)
  2. Jeevana Priya Inala (18 papers)
  3. Michel Galley (50 papers)
  4. Rich Caruana (42 papers)
  5. Jianfeng Gao (344 papers)
Citations (37)