A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders (2409.14507v4)

Published 22 Sep 2024 in cs.CL and cs.AI

Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of LLMs into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.

Authors (5)

David Chanin (7 papers)
James Wilken-Smith (1 paper)
Tomáš Dulka (2 papers)
Hardik Bhatnagar (2 papers)
Joseph Bloom (4 papers)

Citations (2)

View on Semantic Scholar

Summary

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Introduction

The paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders" proposes an exploration into Sparse Autoencoders (SAEs) within the context of LLMs and how these models manage to decompose dense activations into human-interpretable latents. This research poses two critical questions: the extent to which SAEs can extract monosemantic and interpretable latents from LLM activations, and how alterations in sparsity or size of the SAE impact monosemanticity and interpretability. The paper introduces the notion of "feature absorption," a problematic form of feature-splitting where interpretably-aligned features fail to fire under certain conditions.

SAE Performance on First Letter Identification

The paper employs a first-letter identification task to scrutinize the interpretability of SAE latents. The researchers use Linear Regression (LR) probes to establish a baseline performance, comparing these results with SAE latents aligned to the same task. Precision and recall are evaluated, demonstrating that SAEs generally underperform in comparison to linear probes. Results, detailed in Figure 1, highlight that varying the sparsity (L0 norm) and width (neuronal count) of the SAEs does not substantially mitigate this performance gap. Specifically, optimal SAE configurations accomplish high precision but low recall or vice versa, but fail to balance both appropriately.

Feature Absorption: Concept and Case Study

A significant finding of this paper is "feature absorption," defined as cases where an SAE latent seems monosemantic but fails to activate accurately, leading to instances where token-aligned latents absorb the feature. A detailed case paper on the latent corresponding to the "starts with S" feature in a specific SAE configuration provides compelling evidence. For example, the SAE latent aligned with this feature typically activates on tokens like "sample" or "stone" but fails on "short," where a different latent aligns instead. Ablation studies, visualized in Figure 2, corroborate the substantial causal impact of these absorbing latents on model behavior, showing that these latents contribute significantly to the misalignment.

Quantifying Feature Splitting and Absorption

The researchers conducted comprehensive probing to quantify the prevalence of feature splitting and feature absorption across different SAE configurations. Feature splitting is identified through k-sparse probing, revealing that wider and more sparse SAEs exhibit higher rates of feature splitting. Figure 3 illustrates that while more sparse SAEs tend to decompose general features into more specific ones, they do not inherently improve interpretability due to the inconsistent alignment of these splits.

Feature absorption is measured by identifying failure points where intended latents do not activate but token-aligned latents do. This phenomenon is shown to increase with SAE sparsity and width, suggesting a trade-off between sparsity and reliable feature alignment. Figures 4a and 4b detail that feature absorption is extensive, thus questioning the reliability of SAEs for precise interpretability tasks.

Implications and Future Directions

The implications of these findings are significant for AI interpretability and safety-critical applications. Feature absorption underlines a profound challenge for methods targeting circuit interpretability and sparse feature combination. If feature absorption remains unaddressed, the reliability of SAE latents as indicators of internal model behavior—such as bias detection or identifying instances of deceptive behavior—is compromised.

Future work should aim at further validation of these results across different model architectures and tasks beyond first-letter identification. The promising avenue of Meta-SAEs to decompose absorbing features must be explored, along with alternative methods such as attribution dictionary learning. Establishing a standardized framework for evaluating and mitigating phenomena like feature absorption is crucial for advancing interpretability methodologies in LLMs.

Conclusion

This paper contributes notably to understanding the limitations and intricacies of using SAEs for interpretability in LLMs. By highlighting the problematic aspect of feature absorption and its impacts on model behavior understanding, it points to critical areas for future research. As we advance in leveraging LLMs for complex tasks, ensuring robust methods for understanding and interpreting these models will be indispensable.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JBloomAus/status/1838886154935922952

https://twitter.com/fly51fly/status/1840399069153468430

https://twitter.com/zephyr_wade/status/1877018867609801135