What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Abstract: In-context learning is a powerful emergent ability in transformer models. Prior work in mechanistic interpretability has identified a circuit element that may be critical for in-context learning -- the induction head (IH), which performs a match-and-copy operation. During training of large transformers on natural language data, IHs emerge around the same time as a notable phase change in the loss. Despite the robust evidence for IHs and this interesting coincidence with the phase change, relatively little is known about the diversity and emergence dynamics of IHs. Why is there more than one IH, and how are they dependent on each other? Why do IHs appear all of a sudden, and what are the subcircuits that enable them to emerge? We answer these questions by studying IH emergence dynamics in a controlled setting by training on synthetic data. In doing so, we develop and share a novel optogenetics-inspired causal framework for modifying activations throughout training. Using this framework, we delineate the diverse and additive nature of IHs. By clamping subsets of activations throughout training, we then identify three underlying subcircuits that interact to drive IH formation, yielding the phase change. Furthermore, these subcircuits shed light on data-dependent properties of formation, such as phase change timing, already showing the promise of this more in-depth understanding of subcircuits that need to "go right" for an induction head.
- Layer normalization, 2016.
- Millisecond-timescale, genetically targeted optical control of neural activity. Nature Neuroscience, 8(9):1263–1268, September 2005. ISSN 1097-6256, 1546-1726. doi: 10.1038/nn1525. URL https://www.nature.com/articles/nn1525.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Sparse interventions in language models with differentiable masking. CoRR, abs/2112.06837, 2021. URL https://arxiv.org/abs/2112.06837.
- Data distributional properties drive emergent in-context learning in transformers. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 18878–18891. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/77c6ccacfd9962e2307fc64680fc5ace-Paper-Conference.pdf.
- Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms, 2024.
- Towards automated circuit discovery for mechanistic interpretability, 2023.
- Cooney, A. [proposal] improve performance with torch.compile, 2023. URL https://github.com/neelnanda-io/TransformerLens/issues/413.
- The DeepMind JAX Ecosystem, 2020. URL http://github.com/google-deepmind.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Causal analysis of syntactic agreement mechanisms in neural language models. CoRR, abs/2106.06087, 2021. URL https://arxiv.org/abs/2106.06087.
- Fiotto-Kaufman, J. nnsight: The package for interpreting and manipulating the internals of deep learned models. . URL https://github.com/JadenFiotto-Kaufman/nnsight.
- The lottery ticket hypothesis: Training pruned neural networks. CoRR, abs/1803.03635, 2018. URL http://arxiv.org/abs/1803.03635.
- Causal abstractions of neural networks. CoRR, abs/2106.02997, 2021. URL https://arxiv.org/abs/2106.02997.
- Dissecting recall of factual associations in auto-regressive language models, 2023.
- Localizing model behavior with path patching, 2023.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
- Constraints on neural redundancy. eLife, 7:e36774, 2018. doi: 10.7554/eLife.36774. URL https://elifesciences.org/articles/36774.
- Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.0580.
- Multi-component learning and s-curves, 2022. URL https://www.alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves.
- Equinox: neural networks in JAX via callable PyTrees and filtered transformations. Differentiable Programming workshop at Neural Information Processing Systems 2021, 2021.
- Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015. doi: 10.1126/science.aab3050. URL https://www.science.org/doi/abs/10.1126/science.aab3050.
- Tracr: Compiled transformers as a laboratory for interpretability. arXiv preprint arXiv:2301.05062, 2023.
- The hydra effect: Emergent self-repair in language model computations, 2023.
- Circuit component reuse across tasks in transformer language models, 2024.
- Are sixteen heads really better than one? CoRR, abs/1905.10650, 2019. URL http://arxiv.org/abs/1905.10650.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- Nanda, N. A comprehensive mechanistic interpretability explainer & glossary, Dec 2022. URL https://neelnanda.io/glossary.
- Nanda, N. 200 concrete open problems (COP) in MI: Analysing training dynamics, 2023. URL https://www.alignmentforum.org/posts/hHaXzJQi6SKkeXzbg/200-cop-in-mi-analysing-training-dynamics.
- Transformerlens. https://github.com/neelnanda-io/TransformerLens, 2022.
- Progress measures for grokking via mechanistic interpretability, 2023.
- nostalgebraist. interpreting gpt: the logit lens, 2020. URL https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Towards vision-language mechanistic interpretability: A causal tracing tool for blip, 2023.
- PyTorch. We just introduced pytorch 2.0 at the #pytorchconference, introducing torch.compile!, 2022. URL https://x.com/PyTorch/status/1598708792598069249?s=20.
- On the special role of class-selective neurons in early training, 2023.
- Generalization to New Sequential Decision Making Tasks with In-Context Learning, December 2023. URL http://arxiv.org/abs/2312.03801. arXiv:2312.03801 [cs].
- Reddy, G. The mechanistic basis of data dependence and abrupt learning in an in-context classification task, 2023.
- Imagenet large scale visual recognition challenge, 2015.
- The neural race reduction: Dynamics of abstraction in gated networks, 2022.
- Shalizi, C. R. ”attention”, ”transformers”, in neural network ”large language models”, 2024. URL http://bactra.org/notebooks/nn-attention-and-transformers.html#identification.
- The transient nature of emergent in-context learning in transformers, 2023.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Causal mediation analysis for interpreting neural NLP: the case of gender bias. CoRR, abs/2004.12265, 2020. URL https://arxiv.org/abs/2004.12265.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580.
- Transformers learn in-context by gradient descent, 2023.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022.
- Label words are anchors: An information flow perspective for understanding in-context learning, 2023.
- Symbol tuning improves in-context learning in language models, 2023a.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023b.
- pyvene: A library for understanding and improving PyTorch models via interventions. 2024. URL arxiv.org/abs/2403.07809.
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
- Efficient In-Context Learning in Vision-Language Models for Egocentric Videos, November 2023. URL http://arxiv.org/abs/2311.17041. arXiv:2311.17041 [cs].
- The clock and the pizza: Two stories in mechanistic explanation of neural networks, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.