Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neuron-Level Knowledge Attribution in Large Language Models (2312.12141v4)

Published 19 Dec 2023 in cs.CL and cs.LG

Abstract: Identifying important neurons for final predictions is essential for understanding the mechanisms of LLMs. Due to computational constraints, current attribution techniques struggle to operate at neuron level. In this paper, we propose a static method for pinpointing significant neurons. Compared to seven other methods, our approach demonstrates superior performance across three metrics. Additionally, since most static methods typically only identify "value neurons" directly contributing to the final prediction, we propose a method for identifying "query neurons" which activate these "value neurons". Finally, we apply our methods to analyze six types of knowledge across both attention and feed-forward network (FFN) layers. Our method and analysis are helpful for understanding the mechanisms of knowledge storage and set the stage for future research in knowledge editing. The code is available on https://github.com/zepingyu0512/neuron-attribution.

Introduction

Transformer-based models have drastically advanced performance across various AI tasks. While successful on the surface, the intricacies of how these models arrive at their predictions often remain opaque, a problem which impedes further improvement and trustworthiness. Current interpretability approaches struggle with the increasingly complex structures underlying these models, leaving us with pressing questions regarding parameter significance and the accurate location of knowledge within the network's architecture.

Unveiling the Mysteries of Transformers

The key to understanding transformers is dissecting the so-called residual stream—a pathway where the outputs of different layers interact and accumulate. By exploring the residual stream, this paper deciphers the mechanism behind the connections made between these outputs, revealing a direct addition function that impacts the probabilities associated with prediction outcomes. The probability of a given token increases when its before-softmax value is large.

Assigning Contributions and Probing Layers

To pinpoint influential parameters, this research establishes the use of log probability increase as a metric for quantifying a layer's contribution to a prediction. Leveraging this metric, the paper illuminates how each layer—be it attention or feed-forward neural network (FFN)—supports word predictions. Furthermore, analyzing inner products, the research provides insights into the interplay between preceding layers and how they impact subsequent FFN layers.

Empirical Findings and Methodological Innovations

Empirical analyses on a collection of sampled cases indicate that every layer within transformers plays a role in next-word prediction, with knowledge distributed across both attention and FFN layers. Notably, no single layer or module monopolizes importance; several contribute jointly to predictions. Case studies reinforce these findings, demonstrating that paramount transformer-specific features for prediction may reside in both attention and FFN subvalues. Lastly, the research presents a methodological contribution by showcasing a technique for detailing the influence of preceding layers on upper FFN layers.

Roadmap to Interpretability

The paper promises to release the code on GitHub, which will enable the public to implement these interpretability methods. Through such transparency, it is anticipated that the robust interpretability of transformer-based models will be enhanced significantly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Layer normalization. arXiv preprint arXiv:1607.06450.
  2. Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
  6. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535.
  7. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.
  8. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1.
  9. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767.
  10. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680.
  11. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913.
  12. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv preprint arXiv:2301.04213.
  13. Understanding transformer memorization recall through idioms. arXiv preprint arXiv:2210.03588.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  15. Superposition, memorization, and double descent. Transformer Circuits Thread.
  16. Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785.
  17. Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. arXiv preprint arXiv:1902.10186.
  18. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  19. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
  20. In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  22. Improving language understanding by generative pre-training.
  23. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483. IEEE.
  24. Attention is all you need. Advances in neural information processing systems, 30.
  25. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172.
  26. A survey of large language models. arXiv preprint arXiv:2303.18223.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zeping Yu (10 papers)
  2. Sophia Ananiadou (72 papers)
Citations (2)