Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Circuit Component Reuse Across Tasks in Transformer Language Models (2310.08744v3)

Published 12 Oct 2023 in cs.CL and cs.LG

Abstract: Recent work in mechanistic interpretability has shown that behaviors in LLMs can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain LLMs' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Towards automated circuit discovery for mechanistic interpretability, 2023.
  2. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  3. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446.
  4. Dissecting recall of factual associations in auto-regressive language models, 2023.
  5. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969, 2023.
  6. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023.
  7. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, 2023.
  8. Generative models as a complex systems science: How can we make sense of large language model behavior? preprint, 2023.
  9. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  10. What changed? investigating debiasing methods using causal mediation analysis. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp.  255–265, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.gebnlp-1.26. URL https://aclanthology.org/2022.gebnlp-1.26.
  11. Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7057–7075, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.574. URL https://aclanthology.org/2020.emnlp-main.574.
  12. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023.
  13. Learning long-range spatial dependencies with horizontal gated recurrent units. Advances in neural information processing systems, 31, 2018.
  14. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2022.
  15. Language models implement simple word2vec-style vector arithmetic, 2023.
  16. Transformerlens, 2022. URL https://github.com/neelnanda-io/TransformerLens.
  17. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, September 2022. URL https://openreview.net/forum?id=9XFSbDPmdW.
  18. In-context learning and induction heads, 2022.
  19. J Pearl. Direct and indirect effects. 2001in: Proceedings of the seventeenth conference on uncertainty in artificial intelligence.
  20. Language models are unsupervised multitask learners.
  21. Investigating transferability in pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1393–1401, 2020.
  22. Jesse Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  37–42, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-3007. URL https://aclanthology.org/P19-3007.
  23. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. In Advances in Neural Information Processing Systems, volume 33, pp.  12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html.
  24. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5797–5808, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580.
  25. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. In The Eleventh International Conference on Learning Representations, September 2022. URL https://openreview.net/forum?id=NpsVSN6o4ul.
Citations (43)

Summary

We haven't generated a summary for this paper yet.