- The paper identifies that only 1-5% of GPT-2 neurons are universal, exhibiting consistent activation patterns across diverse training seeds.
- Researchers analyzed pairwise activation correlations and weight patterns to classify neurons into distinct families with specific computational roles.
- Findings indicate that universal neurons, such as those influencing token prediction entropy, offer valuable insights for demystifying and optimizing LLM functionality.
Introduction
LLMs such as GPT-2 are increasingly being integrated into applications with widespread impact, thus understanding their internal mechanisms is crucial. The field of mechanistic interpretability often emphasizes the importance of feature as a unit of analysis. This paper explores the hypothesis that certain 'universal neurons'—neurons with highly similar functions across different instances of GPT-2 models—play key roles and that these are tied to interpretability and model functionality.
Universality of Neurons
The paper examines GPT-2 models trained from distinct initial random seeds, analyzing pairwise neuron activation correlations over a large corpus. The key finding is that only 1-5% of neurons are 'universal' showing high correlation across models, which points towards a potential footprint of foundational components within the LLM design. These universal neurons fall into distinct families with specific characteristics influenced by their depth within the network, which correlates with the type of computations they handle.
Functional Roles of Neurons
Beyond activation patterns, the paper presents an exploration of neurons by the weights, revealing distinct and consistent patterns associated with 'functional roles'. Specific insights include the presence of neurons manipulating the entropy of the next token prediction—essentially moderating the model's certainty—and neurons that could increase or decrease attention to specific tokens. Intriguingly, certain neurons operate in antipodal pairs, suggesting the possibility of robust ensembles within the LLM architecture.
Discussion and Correlation to Similar Works
The work references extensive related research, comparing universality in LLMs, which provides a backdrop of how this paper extends our comprehension of neural network similarities. Contrasting earlier works that mostly investigated representational similarity, this paper adds another layer by focusing on the consistency of individual neurons' functionalities. This turns the spotlight on a select few neurons that provide clarity and strategy in demystifying the black box of LLMs.
Conclusion
The researchers present that universality can aid in identifying interpretable and mechanically significant components within LLMs. However, only a small fraction of neurons are shown to be universal, suggesting that future interpretability studies might need to focus on subsystems or circuits rather than individual neurons. Despite working with smaller-scale models and within certain constraints, this research paves the way for strategies in analyzing more complex and larger LLMs and provides a framework for leveraging universality in enhancing interpretability.