How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability (2405.04156v1)
Abstract: Transformer-based LLMs are treated as black-boxes because of their large number of parameters and complex internal interactions, which is a serious safety concern. Mechanistic Interpretability (MI) intends to reverse-engineer neural network behaviors in terms of human-understandable components. In this work, we focus on understanding how GPT-2 Small performs the task of predicting three-letter acronyms. Previous works in the MI field have focused so far on tasks that predict a single token. To the best of our knowledge, this is the first work that tries to mechanistically understand a behavior involving the prediction of multiple consecutive tokens. We discover that the prediction is performed by a circuit composed of 8 attention heads (~5% of the total heads) which we classified in three groups according to their role. We also demonstrate that these heads concentrate the acronym prediction functionality. In addition, we mechanistically interpret the most relevant heads of the circuit and find out that they use positional information which is propagated via the causal mask mechanism. We expect this work to lay the foundation for understanding more complex behaviors involving multiple-token predictions.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. https://openreview.net/forum?id=p4PckNQR8k.
- A circuit for Python docstrings in a 4-layer attention-only transformer. https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only, 2023.
- Multi-step jailbreaking privacy attacks on chatGPT. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. https://openreview.net/forum?id=ls4Pfsl2jZ.
- Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in Chinchilla. arXiv preprint arXiv:2307.09458, 2023.
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS), 2022. https://openreview.net/forum?id=-h6WAS6eE4.
- TransformerLens, 2022. https://github.com/neelnanda-io/TransformerLens.
- Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations (ICLR), 2023. https://openreview.net/forum?id=9XFSbDPmdW.
- Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
- Desi Quintans. The Great Noun List. https://www.desiquintans.com/nounlist, 2023.
- Language models are unsupervised multitask learners. 2019.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
- Large language models in medicine. Nature Medicine, 29(8):1930–1940, Aug 2023. ISSN 1546-170X. doi: 10.1038/s41591-023-02448-8. URL https://doi.org/10.1038/s41591-023-02448-8.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul.
- Jailbroken: How does LLM safety training fail? In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=jA235JGM09.
- Continuous-time decision transformer for healthcare applications. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 of Proceedings of Machine Learning Research, pages 6245–6262. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/zhang23i.html.