Universal Neurons in GPT2 Language Models (2401.12181v1)

Published 22 Jan 2024 in cs.LG, cs.AI, and cs.CL

Abstract: A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in neuron weights to establish several universal functional roles of neurons in simple circuits: deactivating attention heads, changing the entropy of the next token distribution, and predicting the next token to (not) be within a particular set.

Authors (8)

Wes Gurnee (12 papers)
Theo Horsley (1 paper)
Zifan Carl Guo (5 papers)
Tara Rezaei Kheirkhah (2 papers)
Qinyi Sun (4 papers)
Will Hathaway (1 paper)
Neel Nanda (50 papers)
Dimitris Bertsimas (96 papers)

Citations (29)

View on Semantic Scholar

Summary

The paper identifies that only 1-5% of GPT-2 neurons are universal, exhibiting consistent activation patterns across diverse training seeds.
Researchers analyzed pairwise activation correlations and weight patterns to classify neurons into distinct families with specific computational roles.
Findings indicate that universal neurons, such as those influencing token prediction entropy, offer valuable insights for demystifying and optimizing LLM functionality.

Introduction

LLMs such as GPT-2 are increasingly being integrated into applications with widespread impact, thus understanding their internal mechanisms is crucial. The field of mechanistic interpretability often emphasizes the importance of feature as a unit of analysis. This paper explores the hypothesis that certain 'universal neurons'—neurons with highly similar functions across different instances of GPT-2 models—play key roles and that these are tied to interpretability and model functionality.

Universality of Neurons

The paper examines GPT-2 models trained from distinct initial random seeds, analyzing pairwise neuron activation correlations over a large corpus. The key finding is that only 1-5% of neurons are 'universal' showing high correlation across models, which points towards a potential footprint of foundational components within the LLM design. These universal neurons fall into distinct families with specific characteristics influenced by their depth within the network, which correlates with the type of computations they handle.

Functional Roles of Neurons

Beyond activation patterns, the paper presents an exploration of neurons by the weights, revealing distinct and consistent patterns associated with 'functional roles'. Specific insights include the presence of neurons manipulating the entropy of the next token prediction—essentially moderating the model's certainty—and neurons that could increase or decrease attention to specific tokens. Intriguingly, certain neurons operate in antipodal pairs, suggesting the possibility of robust ensembles within the LLM architecture.

Discussion and Correlation to Similar Works

The work references extensive related research, comparing universality in LLMs, which provides a backdrop of how this paper extends our comprehension of neural network similarities. Contrasting earlier works that mostly investigated representational similarity, this paper adds another layer by focusing on the consistency of individual neurons' functionalities. This turns the spotlight on a select few neurons that provide clarity and strategy in demystifying the black box of LLMs.

Conclusion

The researchers present that universality can aid in identifying interpretable and mechanically significant components within LLMs. However, only a small fraction of neurons are shown to be universal, suggesting that future interpretability studies might need to focus on subsystems or circuits rather than individual neurons. Despite working with smaller-scale models and within certain constraints, this research paves the way for strategies in analyzing more complex and larger LLMs and provides a framework for leveraging universality in enhancing interpretability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/wesg52/status/1749829697247342967

https://twitter.com/GreatKingCnut/status/1882318018388303929

https://twitter.com/dpaleka/status/1763282842220728560

https://twitter.com/ChaudharyMaheep/status/1793997883374497936

https://twitter.com/semisance/status/1749739607384576274

https://twitter.com/stooberr/status/1937904802668917037

YouTube

Show All Videos