Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching (2311.17030v2)

Published 28 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.

Citations (14)

View on Semantic Scholar

Summary

The paper reveals that subspace activation patching can produce an interpretability illusion by activating dormant pathways.
It uses experiments on tasks like indirect object identification and factual recall to demonstrate misleading attributions in model outputs.
The study urges using comprehensive circuit analyses to validate subspace interventions and enhance AI interpretability accuracy.

The Illusion of Interpretability in Subspace Activation Patching

The paper "Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching" critically examines the use of subspace interventions in mechanistic interpretability, particularly focusing on a phenomenon termed as interpretability illusion. This work contributes to the field by demonstrating that while subspace activation patching may appear to provide insights into model behavior, it can lead to misleading attributions due to dormant pathways in the model's architecture.

Key Insights and Methodology

Mechanistic interpretability seeks to elucidate model behaviors by attributing them to specific, interpretable features. The paper argues that recent efforts in subspace interventions, such as activation patching, may incautiously align these features with model behavior, leading to misinterpretations. Activation patching modifies the subspace of activations to convey the impression that a feature's value has changed, potentially triggering an unrelated pathway—a dormant pathway—inducing desired model outputs without actually manipulating the intended causal subspace.

The authors employ a series of experiments and theoretical analyses to demonstrate this illusion. Key examples include a distilled math model and studies on established tasks like indirect object identification (IOI) and factual recall in LLMs. The research highlights that subspaces prone to this illusion exist in two areas:

Causally Disconnected Subspace: This part of the subspace contributes changes aligned with the model's output but unrelated to causality, often within the kernel of weights connecting neural components.
Dormant Subspace: Though initially inactive, this part can be activated without directly correlating with input variations, thus influencing outputs unexpectedly when paired with the disconnected subspace.

Empirical Observations

Through the IOI task using GPT-2 Small, the research empirically validates this illusion. It was found that patching subspaces in the MLP layer resulted in seemingly correct model adjustments—suggesting an understanding of feature localization is present. However, upon decomposing these subspaces, it became evident that critical causal components were absent, thereby indicating illusionary behavior.

Similarly, in factual recall tasks, the activation of dormant pathways was demonstrated by altering model predictions via subspace interventions that falsely suggested memory localization. The paper further aligns rank-1 fact editing techniques with similar findings, showing that despite their effectiveness, such edits do not always correlate with factual localization.

Implications for AI Development

This research holds significant implications for the field of AI, particularly in refining interpretability methodologies. The findings alert to the dangers of premature conclusions about model comprehension when utilizing subspace patching methods. The authors recommend corroborating subspace findings with detailed circuit analyses and validating through multiple paradigms to avoid misinterpretability.

Moreover, the work suggests considerations for future AI model design and evaluation strategies. By understanding where such illusions are likely to occur (e.g., between middle layers of network processing), researchers can better target transparency and robustness interventions.

Concluding Thoughts

This paper underscores the need for cautious analysis within AI interpretability research. The interpretability illusion identified here serves as a reminder of the complexities underlying model feature correlations and calls for more rigorous and sophisticated approaches to fully dissecting neural network behaviors. The insights provided emphasize the ongoing need to evolve tools and methodologies that can accurately reflect the intricate mechanisms driving deep learning architectures.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_georg_lange/status/1787952906286809518

https://twitter.com/AMakelov/status/1787945242068443625

https://twitter.com/NeelNanda5/status/1774804706834911408

YouTube

Show All Videos