A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Introduction
The paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders" proposes an exploration into Sparse Autoencoders (SAEs) within the context of LLMs and how these models manage to decompose dense activations into human-interpretable latents. This research poses two critical questions: the extent to which SAEs can extract monosemantic and interpretable latents from LLM activations, and how alterations in sparsity or size of the SAE impact monosemanticity and interpretability. The paper introduces the notion of "feature absorption," a problematic form of feature-splitting where interpretably-aligned features fail to fire under certain conditions.
SAE Performance on First Letter Identification
The paper employs a first-letter identification task to scrutinize the interpretability of SAE latents. The researchers use Linear Regression (LR) probes to establish a baseline performance, comparing these results with SAE latents aligned to the same task. Precision and recall are evaluated, demonstrating that SAEs generally underperform in comparison to linear probes. Results, detailed in Figure 1, highlight that varying the sparsity (L0 norm) and width (neuronal count) of the SAEs does not substantially mitigate this performance gap. Specifically, optimal SAE configurations accomplish high precision but low recall or vice versa, but fail to balance both appropriately.
Feature Absorption: Concept and Case Study
A significant finding of this paper is "feature absorption," defined as cases where an SAE latent seems monosemantic but fails to activate accurately, leading to instances where token-aligned latents absorb the feature. A detailed case paper on the latent corresponding to the "starts with S" feature in a specific SAE configuration provides compelling evidence. For example, the SAE latent aligned with this feature typically activates on tokens like "sample" or "stone" but fails on "short," where a different latent aligns instead. Ablation studies, visualized in Figure 2, corroborate the substantial causal impact of these absorbing latents on model behavior, showing that these latents contribute significantly to the misalignment.
Quantifying Feature Splitting and Absorption
The researchers conducted comprehensive probing to quantify the prevalence of feature splitting and feature absorption across different SAE configurations. Feature splitting is identified through k-sparse probing, revealing that wider and more sparse SAEs exhibit higher rates of feature splitting. Figure 3 illustrates that while more sparse SAEs tend to decompose general features into more specific ones, they do not inherently improve interpretability due to the inconsistent alignment of these splits.
Feature absorption is measured by identifying failure points where intended latents do not activate but token-aligned latents do. This phenomenon is shown to increase with SAE sparsity and width, suggesting a trade-off between sparsity and reliable feature alignment. Figures 4a and 4b detail that feature absorption is extensive, thus questioning the reliability of SAEs for precise interpretability tasks.
Implications and Future Directions
The implications of these findings are significant for AI interpretability and safety-critical applications. Feature absorption underlines a profound challenge for methods targeting circuit interpretability and sparse feature combination. If feature absorption remains unaddressed, the reliability of SAE latents as indicators of internal model behavior—such as bias detection or identifying instances of deceptive behavior—is compromised.
Future work should aim at further validation of these results across different model architectures and tasks beyond first-letter identification. The promising avenue of Meta-SAEs to decompose absorbing features must be explored, along with alternative methods such as attribution dictionary learning. Establishing a standardized framework for evaluating and mitigating phenomena like feature absorption is crucial for advancing interpretability methodologies in LLMs.
Conclusion
This paper contributes notably to understanding the limitations and intricacies of using SAEs for interpretability in LLMs. By highlighting the problematic aspect of feature absorption and its impacts on model behavior understanding, it points to critical areas for future research. As we advance in leveraging LLMs for complex tasks, ensuring robust methods for understanding and interpreting these models will be indispensable.