The paper "Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability" aims to elucidate how LLMs, such as GPT-2-XL, internalize and demonstrate language competences at the neuron level, leveraging psycholinguistic paradigms. This interpretability paper is significant due to the burgeoning capabilities of LLMs in linguistic tasks, necessitating a deeper understanding of their internal workings.
To probe these mechanisms, the authors designed three distinct tasks:
- Sound-Shape Association - This task examines the model's ability to link phonetic properties of words to visual shapes, a cognitive ability observed in humans where certain sounds are intuitively associated with specific shapes (e.g., "bouba" with rounded shapes and "kiki" with spiky shapes).
- Sound-Gender Association - This investigates the model's capacity to associate sounds with gendered perceptions, reflecting sociolinguistic patterns where certain phonetic elements might be perceived as more feminine or masculine.
- Implicit Causality - This task assesses the model's understanding of causal relationships embedded in linguistic structures, crucial for grasping coherence and inferential aspects of language processing.
Key findings from these experiments demonstrated that GPT-2-XL exhibited performance discrepancies across different tasks. Specifically:
- Sound-Shape Association: GPT-2-XL struggled with this task, indicating a lack of clear neuron-level representation for this form of cognitive association.
- Sound-Gender Association and Implicit Causality: The model showcased human-like abilities, suggesting more robust neuron-level coding for these linguistic competences.
A critical methodological component of the paper involved targeted neuron ablation and activation manipulation to uncover the neurons' roles in specific linguistic capabilities. The results reveal:
- Neuron Specificity: There is a discernible relationship between the model's linguistic competence and specific neuron activation. When GPT-2-XL demonstrated a particular linguistic ability, it corresponded to the activation of specific neurons.
- Absence of Ability: Conversely, the absence of a linguistic competence in the model correlated with the lack of specialized neuron activation.
This approach provided unique insights into model interpretability by identifying "language competence neurons." It marked the first use of psycholinguistic experiments to probe neuron-level representations in transformer-based LLMs, introducing a nuanced framework for understanding internal mechanisms driving language abilities. The paper's findings propel forward the interpretability of LLMs, enhancing our comprehension of how these models process and generate human-like language competences on a neuronal level.