Sparse Autoencoders Can Interpret Randomly Initialized Transformers (2501.17727v1)

Published 29 Jan 2025 in cs.LG

Abstract: Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the internal representations of transformers. In this paper, we apply SAEs to 'interpret' random transformers, i.e., transformers where the parameters are sampled IID from a Gaussian rather than trained on text data. We find that random and trained transformers produce similarly interpretable SAE latents, and we confirm this finding quantitatively using an open-source auto-interpretability pipeline. Further, we find that SAE quality metrics are broadly similar for random and trained transformers. We find that these results hold across model sizes and layers. We discuss a number of number interesting questions that this work raises for the use of SAEs and auto-interpretability in the context of mechanistic interpretability.

Summary

The paper shows that sparse autoencoders extract similarly interpretable features from both random and trained transformers, suggesting that inherent statistical properties may drive observed interpretability.
It employs an open-source evaluation pipeline to compare auto-interpretability metrics consistently across multiple transformer sizes and configurations.
The results question the efficacy of current SAE methods in isolating genuinely learned representations, calling for more rigorous mechanistic interpretability techniques.

The paper entitled "Sparse Autoencoders Can Interpret Randomly Initialized Transformers" investigates the applicability of Sparse Autoencoders (SAEs) in the field of mechanistic interpretability, specifically when applied to randomly initialized transformer models. The central inquiry is whether SAEs can discern meaningful latent representations when applied to transformers that have not been explicitly trained, with random parameters sampled independently and identically distributed (IID) from a Gaussian distribution.

Key Findings:

1. Similarity in Interpretability:

The paper demonstrates that sparse autoencoders, when applied to randomly initialized transformers, extract latent features that appear similarly interpretable to those obtained from trained transformers. This suggests that SAEs do not inherently differentiate between trained and untrained model parameters in terms of resulting feature interpretability.

2. Evaluation Metrics:

Numerically, the paper shows that the auto-interpretability metrics for SAE latents are comparable between random and trained transformers. These metrics include auto-interpretability measures that are computed using an open-source evaluation pipeline. Interestingly, the similarity in these metrics across different transformer states (random and trained) remains consistent across varied model sizes and layers.

3. Implications for Mechanistic Interpretability:

These results bring into question the effectiveness of SAEs in capturing the computational mechanisms learned by neural networks. The authors argue that the fact SAEs extract similarly interpretable features from untrained models suggests that the interpretability might be more reflective of the statistical properties of the input data and model architecture, rather than of any learned computations.

4. Feature Extraction Consistency:

The paper discusses the long-standing hypothesis that transformers could amplify pre-existing latent structures (potentially present in the input data), regardless of training. The findings raise important questions about the fidelity of SAEs in highlighting genuinely learned features as opposed to inherent statistical structures.

5. Qualitative Differences:

Despite the quantitative similarities in auto-interpretability scores, the authors hypothesize potential qualitative differences in the nature of extracted features between trained and randomized models. They suggest that while both might show similar sparsity, the nature of the processes these features represent could be distinct, warranting further investigation into the abstractness and specificity of discovered features.

Methodology:

The paper involves training SAEs on activations from both trained and randomly initialized versions of transformer models from the Pythia model suite, varying in size from millions to billions of parameters.
SAEs are configured with high-dimensional hidden layers trained to minimize reconstruction errors while enforcing sparsity constraints.
Various model states are examined: fully-trained, random initialization ("Step-0"), re-randomization with and without embeddings, and a control setup with Gaussian noise inputs.

Conclusion:

This research highlights that while SAEs are adept at extracting sparse and interpretable features from neural network activations, their application to untrained models challenges current assumptions about their utility in mechanistic interpretability. Specifically, it suggests the need for the development of more nuanced techniques capable of identifying true learned representations, as current SAE approaches might conflate statistical artifacts with learned model functions. The authors call for further exploration into more rigorous benchmarks and validation methods for interpretability tools in complex neural architectures.

PDF Markdown

Related Papers

Tweets

https://twitter.com/aidanprattewart/status/1885807771209146570

https://twitter.com/banburismus_/status/1886127878623858822

https://twitter.com/fly51fly/status/1886177901487567063

https://twitter.com/banburismus_/status/1892252719861092844

https://twitter.com/GptMaestro/status/1886770325326594075

YouTube

Show All Videos