Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models (2403.19521v4)

Published 28 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this paper, we delve into several mechanisms employed by Transformer-based LLMs for factual recall tasks. We outline a pipeline consisting of three major steps: (1) Given a prompt The capital of France is,'' task-specific attention heads extract the topic token, such asFrance,'' from the context and pass it to subsequent MLPs. (2) As attention heads' outputs are aggregated with equal weight and added to the residual stream, the subsequent MLP acts as an activation,'' which either erases or amplifies the information originating from individual heads. As a result, the topic tokenFrance'' stands out in the residual stream. (3) A deep MLP takes France'' and generates a component that redirects the residual stream towards the direction of the correct answer, i.e.,Paris.'' This procedure is akin to applying an implicit function such as ``get_capital($X$),'' and the argument $X$ is the topic token information passed by attention heads. To achieve the above quantitative and qualitative analysis for MLPs, we proposed a novel analytic method aimed at decomposing the outputs of the MLP into components understandable by humans. Additionally, we observed a universal anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. We mitigate this suppression by leveraging our interpretation to improve factual recall confidence. The above interpretations are evaluated across diverse tasks spanning various domains of factual knowledge, using various LLMs from the GPT-2 families, 1.3B OPT, up to 7B Llama-2, and in both zero- and few-shot setups.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Language models are few-shot learners, 2020.
  2. Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use, 2024.
  3. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=89ia77nZ8u.
  4. An adversarial example for direct logit attribution: Memory management in gelu-4l, 2023.
  5. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, 2024.
  6. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  7. Softmax linear units. Transformer Circuits Thread, 2022a. https://transformer-circuits.pub/2022/solu/index.html.
  8. Toy models of superposition. Transformer Circuits Thread, 2022b.
  9. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space, 2022.
  10. Dissecting recall of factual associations in auto-regressive language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.751. URL https://aclanthology.org/2023.emnlp-main.751.
  11. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  13. Monotonic representation of numeric properties in language models, 2024.
  14. In-context learning creates task vectors, 2023.
  15. Inspecting and editing knowledge representations in language models, 2023.
  16. Transformer language models handle word frequency in prediction head. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 4523–4535, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.276. URL https://aclanthology.org/2023.findings-acl.276.
  17. Jorrit Kruthoff. Carrying over algorithm in transformers, 2024.
  18. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  19. Stylized dialogue generation with multi-pass dual learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 28470–28481. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ef67f7c2d86352c2c42e19d20f881f53-Paper.pdf.
  20. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023.
  21. Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse, 2023.
  22. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ebt7JgMHv1.
  23. The hydra effect: Emergent self-repair in language model computations, 2023.
  24. Locating and editing factual associations in GPT. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=-h6WAS6eE4.
  25. A mechanism for solving relational tasks in transformer language models, 2023.
  26. Circuit component reuse across tasks in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fpoAYV6Wsk.
  27. Evan Miller. Attention is off by one, 2023. URL https://www.evanmiller.org/attention-is-off-by-one.html.
  28. nostalgebraist. Interpreting gpt: the logit lens., 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
  29. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  30. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  31. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  32. OpenAI. Gpt-4 technical report, 2023.
  33. Competition of mechanisms: Tracing how language models handle facts and counterfactuals, 2024.
  34. Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, page 411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.
  35. Judea Pearl. Causality. Cambridge University Press, 2 edition, 2009.
  36. Language models are unsupervised multitask learners. arxiv, 2019.
  37. Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=aN4Jf6Cx69.
  38. Model evaluation for extreme risks, 2023.
  39. Function vectors in large language models, 2023.
  40. Llama: Open and efficient foundation language models, 2023a.
  41. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  42. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv preprint arXiv:2401.01275, 2024.
  43. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  44. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
  45. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul.
  46. Do llamas work in english? on the latent language of multilingual transformers, 2024.
  47. Efficient streaming language models with attention sinks, 2023.
  48. Characterizing mechanisms for factual recall in language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=OUiW2DzpzT.
  49. Locating factual knowledge in large language models: Exploring the residual stream and analyzing subvalues in vocabulary space, 2024.
  50. Towards best practices of activation patching in language models: Metrics and methods. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Hf17y6u9BC.
  51. Batch-icl: Effective, efficient, and order-agnostic in-context learning, 2024.
  52. Opt: Open pre-trained transformer language models, 2022.
  53. Siren’s song in the ai ocean: A survey on hallucination in large language models, 2023.
  54. A survey of large language models, 2023.
  55. Universal and transferable adversarial attacks on aligned language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ang Lv (19 papers)
  2. Kaiyi Zhang (11 papers)
  3. Yuhan Chen (39 papers)
  4. Yulong Wang (58 papers)
  5. Lifeng Liu (11 papers)
  6. Ji-Rong Wen (299 papers)
  7. Jian Xie (39 papers)
  8. Rui Yan (250 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets