Rigorously Assessing Natural Language Explanations of Neurons
Abstract: Natural language is an appealing medium for explaining how LLMs process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron $a$ activates on all and only input strings that refer to a concept picked out by the proposed explanation $E$. In the intervention mode, we construe $E$ as a claim that the neuron $a$ is a causal mediator of the concept denoted by $E$. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether natural language is a good choice for explanations and whether neurons are the best level of analysis.
- CEBab: Estimating the causal effects of real-world concepts on NLP model behavior. In Advances in Neural Information Processing Systems.
- Omer Antverg and Yonatan Belinkov. 2022. On the pitfalls of analyzing individual neurons in language models. In International Conference on Learning Representations.
- Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283–294, Toronto, Canada. Association for Computational Linguistics.
- Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
- Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
- Neural network attributions: A causal perspective. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 981–990. PMLR.
- Towards automated circuit discovery for mechanistic interpretability.
- e-snli-ve-2.0: Corrected visual-textual entailment with natural language explanations. CoRR, abs/2004.03744.
- Toy models of superposition. Transformer Circuits Thread.
- CausaLM: Causal model explanation through counterfactual language models. Computational Linguistics, 47(2):333–386.
- Causal abstractions of neural networks. In Advances in Neural Information Processing Systems.
- Causal abstraction for faithful model interpretation. Ms., Stanford University.
- Inducing causal structure for interpretable neural networks. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 7324–7338. PMLR.
- Finding alignments between interpretable causal variables and distributed neural representations. Ms., Stanford University.
- Dissecting recall of factual associations in auto-regressive language models.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Generating visual explanations. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pages 3–19. Springer.
- Natural language descriptions of deep features. In International Conference on Learning Representations.
- Inducing character-level structure in subword-based language models with type-level interchange intervention training. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12163–12180, Toronto, Canada. Association for Computational Linguistics.
- Ayush Kaushal and Kyle Mahowald. 2022. What do tokens know about their characters and how do they know it? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2487–2507, Seattle, United States. Association for Computational Linguistics.
- Explaining chest x-ray pathologies in natural language. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2022 - 25th International Conference, Singapore, September 18-22, 2022, Proceedings, Part V, volume 13435 of Lecture Notes in Computer Science, pages 701–713. Springer.
- Textual explanations for self-driving vehicles. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, volume 11206 of Lecture Notes in Computer Science, pages 577–593. Springer.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 158–167. Association for Computational Linguistics.
- Disentangling visual and written concepts in clip. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16389–16398.
- Parallel Distributed Processing. Volume 2: Psychological and Biological Models. MIT Press, Cambridge, MA.
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems.
- Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations.
- Language models implement simple word2vec-style vector arithmetic.
- Jesse Mu and Jacob Andreas. 2020. Compositional explanations of neurons. In Advances in Neural Information Processing Systems, volume 33, pages 17153–17163. Curran Associates, Inc.
- In-context learning and induction heads. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Barbara H Partee. 1995. Lexical semantics and compositionality. In Lila R. Gleitman and Mark Liberman, editors, Invitation to Cognitive Science, volume 1, pages 311–360. MIT Press, Cambridge, MA.
- Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological methods, 19.
- Christopher Potts and Roger Levy. 2015. Negotiating lexical uncertainty and speaker expertise with disjunction. In Proceedings of the 41st Annual Meeting of the Berkeley Linguistics Society, pages 417–445, Berkeley, CA. Berkeley Linguistics Society.
- Language models are unsupervised multitask learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
- Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA.
- Explaining black box text modules in natural language with language models.
- Paul Smolensky. 1988. On the proper treatment of connectionism. Behavioral and Brain Sciences, 11(1):1–23.
- Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR.
- Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations.
- BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
- Sarah Wiegreffe and Ana Marasovic. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.