Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals (2402.11655v2)

Published 18 Feb 2024 in cs.CL

Abstract: Interpretability research aims to bridge the gap between empirical success and our scientific understanding of the inner workings of LLMs. However, most existing research focuses on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose a formulation of competition of mechanisms, which focuses on the interplay of multiple mechanisms instead of individual mechanisms and traces how one of them becomes dominant in the final prediction. We uncover how and where mechanisms compete within LLMs using two interpretability methods: logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components and reveal attention positions that effectively control the strength of certain mechanisms. Code: https://github.com/francescortu/comp-mech. Data: https://huggingface.co/datasets/francescortu/comp-mech.

References (38)

Authors (6)

Francesco Ortu (4 papers)
Zhijing Jin (68 papers)
Diego Doimo (11 papers)
Mrinmaya Sachan (124 papers)
Alberto Cazzaniga (12 papers)
Bernhard Schölkopf (412 papers)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces the 'competition of mechanisms' framework to trace how LLMs prioritize factual and counterfactual information.
It employs logit inspection and attention modification on models like GPT-2 and Pythia-6.9B to uncover layer-wise roles and specialized attention heads.
Findings reveal that attention blocks mainly promote counterfactual predictions while larger models favor factual recall when semantic similarity increases.

Competition of Mechanisms: Tracing How LLMs Handle Facts and Counterfactuals

The paper "Competition of Mechanisms: Tracing How LLMs Handle Facts and Counterfactuals" investigates the interaction of multiple mechanisms within LLMs to understand which mechanisms predominate when processing factual and counterfactual information. This work contributes to the field of interpretability research, departing from existing studies which generally focus on individual mechanisms such as knowledge recall or token copying within LLMs. Instead, it introduces a new framework called the "competition of mechanisms," to explore how various underlying mechanisms interact and lead to a model’s ultimate decision.

Methodology

The paper employs two primary interpretability methods: logit inspection and attention modification. Logit inspection involves projecting the internal state of the residual stream to the vocabulary space using an unembedding matrix, allowing for the examination of token-specific logits throughout different layers of the model. Attention modification involves altering the attention weights in specific matrices to influence model behavior intentionally.

These methods are applied to autoregressive LLMs, specifically GPT-2 and Pythia-6.9B, using a dataset where factual attributes conflict against counterfactual statements. By analyzing the logits and attention patterns, the authors aim to identify where and how factual or counterfactual knowledge becomes dominant, tracing the contributions from different model components such as attention blocks and MLP layers.

Key Findings

Layer-Wise Mechanism Dynamics: The paper finds that, in the initial layers of GPT-2, factual knowledge is predominantly encoded in the subject position, while counterfactual information is primarily stored in the attribute position. As information propagates through the model, the attention blocks play a significant role in transferring this information to the final sequence position where it influences the prediction.
Component Contributions: Attention blocks substantially contribute to promoting counterfactual predictions, while MLPs contribute to a lesser extent. Only in the final layer does attention block modification slightly favor factual knowledge, but overall, the attention heads read more from the attribute to influence the model's output.
Role of Specific Attention Heads: Certain attention heads were found to be highly specialized, either promoting the factual or counterfactual token. These heads showed a pronounced attention pattern to the attribute position. Enhancing the attention scores for these specialized heads increased the models' rates of predicting the factual token.
Impact of Semantic Similarity: The paper observes that the competition between mechanisms intensifies with increased semantic similarity between the factual and counterfactual attributes. Larger models exhibit stronger reliance on factual recall in such scenarios, suggesting an enhanced capacity to store and retrieve factual information as model size grows.

Implications and Future Work

This work underscores the nuanced interactions within LLMs when presented with competing mechanisms, emphasizing the importance of understanding such dynamics to improve both interpretability and reliability of LLMs. The findings have practical implications for enhancing model accuracy, particularly in scenarios where factual correctness is essential.

Future developments could build on this framework to explore larger and more complex models, extending the analyses to a wider variety of datasets and linguistic structures. Understanding the variability in prompt structures and exploring additional mechanisms could lead to more sophisticated tuning of attention mechanisms, potentially enabling better control over factual and counterfactual predictions in LLM applications.

In summary, this paper advances our comprehension of the internal mechanics of LLMs by presenting a novel approach to interpret how these models prioritize between factual recall and counterfactual adaptation. The insights gained from this research provide a solid foundation for improving current models and addressing challenges associated with model interpretability and reliability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ZhijingJin/status/1794380846985880008

https://twitter.com/francescortu/status/1800140574210855395

YouTube

Show All Videos