Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions (2402.15055v2)

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Understanding the inner workings of LLMs is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied independently, their interactions remain largely unexplored. This study investigates how attention heads and next-token neurons interact in LLMs to predict new words. We propose a methodology to identify next-token neurons, find prompts that highly activate them, and determine the upstream attention heads responsible. We then generate and evaluate explanations for the activity of these attention heads in an automated manner. Our findings reveal that some attention heads recognize specific contexts relevant to predicting a token and activate a downstream token-predicting neuron accordingly. This mechanism provides a deeper understanding of how attention heads work with MLP neurons to perform next-token prediction. Our approach offers a foundation for further research into the intricate workings of LLMs and their impact on text generation and understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Pythia: A suite for analyzing large language models across training and scaling.
  2. Language models can explain neurons in language models.
  3. An interpretability illusion for BERT. CoRR, abs/2104.07143.
  4. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. Knowledge neurons in pretrained transformers. CoRR, abs/2104.08696.
  7. A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html.
  8. Explaining how transformers use context to build predictions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5513, Toronto, Canada. Association for Computational Linguistics.
  9. Neuron to graph: Interpreting language model neurons at scale.
  10. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  11. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  12. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913.
  13. Finding neurons in a haystack: Case studies with sparse probing.
  14. Detecting edit failures in large language models: An improved specificity benchmark.
  15. Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. arXiv preprint arXiv:1902.10186.
  16. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36.
  17. Mass editing memory in a transformer. ArXiv preprint, abs/2210.07229.
  18. The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. In Findings of the Association for Computational Linguistics: ACL 2023, pages 272–292, Toronto, Canada. Association for Computational Linguistics.
  19. Joseph Miller and Clement Neo. 2023. We found an neuron in GPT-2.
  20. In-context learning and induction heads. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  21. OpenAI. 2023. GPT-4 technical report.
  22. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444.
  23. Language models are unsupervised multitask learners.
  24. Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics.
  25. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  26. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Clement Neo (9 papers)
  2. Shay B. Cohen (78 papers)
  3. Fazl Barez (42 papers)
Citations (3)