Sparse Autoencoders Do Not Find Canonical Units of Analysis (2502.04878v1)

Published 7 Feb 2025 in cs.LG and cs.AI

Abstract: A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing Einstein'' decomposes intoscientist'', Germany'', andfamous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/

Summary

The paper finds that increasing SAE dictionary size uncovers novel, compositional features rather than a complete set of atomic units.
It introduces SAE stitching and meta-SAEs to systematically compare and decompose neural network decoder directions.
Empirical results show that even with larger models, SAEs produce hierarchical, overlapping representations, questioning the existence of canonical features.

Sparse Autoencoders Do Not Find Canonical Units of Analysis

The paper systematically investigates the capacity of sparse autoencoders (SAEs) to discover a unique, complete, and atomic set of features—termed “canonical units of analysis”—from neural network activations, particularly in the context of mechanistic interpretability for LLMs. The authors introduce two new analytical tools (SAE stitching and meta-SAEs) and present empirical findings that directly challenge key hypotheses in current interpretability research regarding the canonicality and atomicity of SAE-learned features.

Overview of Contributions

The paper centers on two core questions:

Completeness: Do SAEs, when scaled up, recover all possible relevant features present in the data?
Atomicity: Are the features discovered by SAEs irreducible, or do they compose into more fundamental units?

To probe these questions, the authors develop:

SAE Stitching: A method for systematically swapping or inserting latents between SAEs of differing dictionary sizes, which enables the identification of “novel” (previously missing) versus “reconstruction” (redundant or overlapping) features.
Meta-SAEs: A secondary SAE trained on the decoder directions of another SAE, which examines whether seemingly atomic latents from a large SAE can themselves be decomposed.

Technical Summary

Sparse Autoencoders in LLMs:

SAEs decompose high-dimensional activation vectors into sparse combinations of learned directions—latents—intended to correspond to interpretable “features.” The expectation has been that increasing SAE dictionary size allows the model to approach a set of canonical features, effectively mapping the model’s computation into a human-understandable basis.

Empirical Methods:

Numerous SAEs are trained on GPT-2 Small and Gemma 2B models, systematically varying dictionary sizes from hundreds to nearly 100,000 dimensions.
SAE stitching utilizes cosine similarity between decoder directions to classify latents as either “novel” (low similarity to any latent in a smaller SAE) or “reconstruction” (high similarity, suggesting redundancy).
Meta-SAEs are trained with the decoder directions from a large SAE as input data, with the meta-latents corresponding to decompositions of these directions.

Key Findings:

Incomplete Coverage: Small SAEs miss features—evidenced by “novel” latents in larger SAEs that, when stitched into smaller models, measurably improve reconstruction.
Lack of Atomicity: Many large-SAE latents are shown, via meta-SAEs, to be compositional—often interpretable as combinations (sometimes sparse, sometimes not) of more basic features that can themselves be linked to latents in smaller SAEs.
No Canonical Boundary: There is no observed SAE width at which learned features are simultaneously unique, complete, and irreducible. Instead, SAEs at different scales produce hierarchies of features at varying granularity and degrees of compositionality.

Numerical and Evaluation Results

Reconstruction Effectiveness: For GPT-2 Small, increasing SAE dictionary size progressively reduces MSE (Table \ref{tab:saes}), but additional “novel” features continue to be discovered up to the largest model.
Meta-SAE Decomposition: In a 49,152-latent SAE, meta-SAEs with as few as 2,304 meta-latents can explain over 55% of the variance in decoder directions, with individual large-latent directions being interpretable as linear combinations of meta-latents tied to high-level concepts.
Automated Interpretability: Meta-latents explanations recover correct latent semantic explanations 73% of the time in a zero-shot multiple-choice probe using an LLM, demonstrating both interpretability and information overlap with smaller SAEs.

Practical Implications

For practitioners in interpretability:

The notion that simply scaling up SAE dictionary size will eventually produce a unique, irreducible set of semantic features is empirically unsupported.
The features found by SAEs are not invariant to dictionary size—larger models not only detect increasingly fine-grained phenomena but also form features that are compositions rather than fundamental atoms of computation.
The selection of SAE width (dictionary size) must be made pragmatically, with respect to the interpretability or intervention objective, rather than in search of a “true” basis.

Algorithmic and Implementation Advice:

SAE Stitching: When comparing SAEs of different widths, cluster decoder directions by cosine similarity to identify shared vs. novel features. This allows for smooth interpolation between SAE configurations for diverse interpretability experiments.
Meta-SAEs: Training a meta-SAE (using e.g., BatchTopK, JumpReLU, or other recent sparse coding variants) on existing SAE decoder sets can reveal compositionality and may assist in quantifying polysemanticity versus monosemanticity in feature sets.
Evaluation: Employ linear probes or sparser diagnostic probes (using only 1 or a few latents) to test for presence and specificity of known ground-truth features. Results indicate no monotonic relationship between increased SAE size and improved probe accuracy or concept causality disentanglement.

Resource and Scaling Considerations:

Scaling SAEs to very high dictionary sizes significantly increases both compute and storage requirements, yet the marginal returns in unique, interpretable features diminish.
Evaluation must account for both reconstruction quality and interpretability—larger SAEs can “hide” compositional features behind sparser, but more polysemantic, latents.

Theoretical and Research Implications

The results challenge the assumption that high-sparsity, overcomplete representations are sufficient for capturing “ground truth” semantic features in deep networks.
The findings bolster the growing literature suggesting that superposition, feature splitting, and compositionality are not mere artifacts, but fundamental limits of current dictionary learning in deep model activations.
It remains open whether alternative methods (e.g., non-linear, structured, or supervised decompositions) can succeed at identifying canonical units, or whether model representations inherently resist such atomic decomposition.

Future Directions

Further research is advised in:

Investigating non-linear feature decompositions, potentially incorporating relational or graph-based constraints to encourage atomicity.
Benchmarking alternative sparse coding approaches against the compositionality and completeness criteria on a wider set of model architectures and domains.
Task-specific dictionary adaptation, where SAE width and sparsity are dynamically tuned to the requirements of the interpretability or control task, rather than seeking universality.

The interactive dashboard and code resources associated with this work provide a valuable foundation for extending these analyses and fostering more rigorous, task-driven approaches to feature extraction in mechanistically interpretable AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/BartBussmann/status/1889338362554716318

https://twitter.com/BartBussmann/status/1889338377419227376