CausalGym: Benchmarking causal interpretability methods on linguistic tasks (2402.12560v1)
Abstract: LLMs (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.
- Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France.
- Naturalistic causal probing for morpho-syntax. Transactions of the Association for Computational Linguistics, 11:384–403.
- Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
- Nora Belrose. 2023. Diff-in-means concept editing is worst-case optimal. EleutherAI Blog.
- LEACE: Perfect linear concept erasure in closed form. arXiv:2306.03819.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430, Honolulu, Hawaii, USA. PMLR.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Causal scrubbing: A method for rigorously testing interpretability hypotheses. In Alignment Forum.
- Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. arXiv:2309.07311.
- Identifying and adapting transformer-components responsible for gender bias in an English language model. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 379–394, Singapore. Association for Computational Linguistics.
- SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- Toy models of superposition. Transformer Circuits Thread.
- A mathematical framework for transformer circuits. Transformer Circuits Thread.
- Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, Berlin, Germany. Association for Computational Linguistics.
- Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
- SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76, Online. Association for Computational Linguistics.
- Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 9574–9586. Curran Associates, Inc.
- Causal abstraction for faithful model interpretation. arxiv:2301.04709.
- Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning, ICML 2022, volume 162 of Proceedings of Machine Learning Research, pages 7324–7338, Baltimore, Maryland, USA. PMLR.
- Finding alignments between interpretable causal variables and distributed neural representations. arXiv:2303.02536.
- Localizing model behavior with path patching. arXiv:2304.05969.
- Adam Goodkind and Klinton Bicknell. 2018. Predictive power of word surprisal for reading times is a linear function of language model quality. In Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018), pages 10–18, Salt Lake City, Utah. Association for Computational Linguistics.
- A geometric notion of causal probing. arXiv:2307.15054.
- Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Linguistics.
- When language models fall in love: Animacy processing in transformer language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12120–12135, Singapore. Association for Computational Linguistics.
- Sophie Hao and Tal Linzen. 2023. Verb conjugation in transformers is determined by linear encodings of subject number. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4531–4539, Singapore. Association for Computational Linguistics.
- John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
- A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
- Language models align with human judgments on key grammatical constructions. arXiv:2402.01676.
- Mission: Impossible language models. arXiv:2401.06416.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA.
- Probing for the usage of grammatical number. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8818–8831, Dublin, Ireland. Association for Computational Linguistics.
- Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems, volume 36.
- Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
- Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv:2310.06824.
- Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, volume 35, pages 17359–17372. Curran Associates, Inc.
- Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia. Association for Computational Linguistics.
- Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 16–30, Singapore. Association for Computational Linguistics.
- Chris Olah. 2022. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread.
- The linear representation hypothesis and the geometry of large language models. arXiv:2311.03658.
- Judea Pearl. 2009. Causality: Models, Reasoning, and Inference, 2nd edition. Cambridge University Press.
- Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12:2825–2830.
- Large-scale evidence for logarithmic effects of word predictability on reading time. Proceedings of the National Academy of Sciences. To appear.
- Nathaniel J. Smith and Roger Levy. 2013. The effect of word predictability on reading time is logarithmic. Cognition, 128(3):302–319.
- Linear representations of sentiment in large language models. arXiv:2310.15154.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.
- Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.
- Alex Warstadt and Samuel R. Bowman. 2022. What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, pages 17–60.
- BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
- Language model quality correlates with psychometric predictive power in multiple languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7503–7511, Singapore. Association for Computational Linguistics.
- Using computational models to test syntactic learnability. Linguistic Inquiry, pages 1–44.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- pyvene: A library for understanding and improving PyTorch models via interventions. Under review.
- Interpretability at scale: Identifying causal mechanisms in Alpaca. In Advances in Neural Information Processing Systems, volume 36.
- Causal interventions expose implicit situation models for commonsense language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13265–13293, Toronto, Canada. Association for Computational Linguistics.