- The paper demonstrates that sparse autoencoders trained with different random seeds uncover only ~30% shared features, highlighting significant seed-dependent variability.
- It employs the Hungarian algorithm for optimal feature alignment, revealing a bimodal cosine similarity distribution between consistently shared and orphan features.
- The results challenge the assumption of a canonical feature set in neural networks, urging a more nuanced interpretation of model activations.
Sparse Autoencoders Trained on the Same Data Learn Different Features
Sparse Autoencoders (SAEs) have emerged as a pivotal tool in elucidating the interpretability of activations within LLMs. This paper scrutinizes the behaviour of SAEs trained on identical data with varying random seeds and scrutinizes the implications of its findings on the understanding of neural network features. Through this examination, the paper argues that distinct random seeds lead SAEs to uncover disparate sets of features, challenging the expectation that SAEs can reveal a canonical set of features used by a model.
Results Overview
The paper presents compelling evidence from experiments utilizing multiple SAE architectures trained on various LLMs and datasets. In particular, an extensive empirical paper involving an SAE with 131,000 latents trained on Llama 3 8B demonstrates that only 30% of the features were consistent across different seeds. This finding is consistent across different layers of three LLMs, underscoring the seed-dependence of features identified by SAEs. Furthermore, it highlights a distinctive variability among contemporary activation functions: while SAEs with ReLU activations exhibited greater robustness across seeds, those employing the state-of-the-art TopK activation were more influenced by the initialization.
Methodological Approach
To address the challenge of differing feature alignments due to random initializations, the authors adopted the Hungarian algorithm to achieve an optimal bijective matching of features. This alignment metric was essential to quantifying the degree of overlap between distinct SAEs. The analysis revealed a bimodal distribution of cosine similarities for aligned features, distinguishing between high-similarity "shared" features and low-similarity "orphan" features.
Implications and Speculation
The implications of this research are manifold. Practically, it suggests caution in interpreting the feature set obtained from a single SAE as exhaustive or definitive, thus advising a more application-specific, pragmatic use of such decompositions. From a theoretical vantage, these findings contest the presupposition of an objective decomposition of neural networks into universal feature sets; instead, features extracted may vary based on the training run.
This variability invites further research into factors influencing feature emergence, including but not limited to data characteristics, model scale, and initialization. The potential convergence of SAE-discovered features and their dependence on specific neural architectures are promising future avenues for exploration. Moreover, by understanding such variability, enhancements in model interpretability and robustness evaluations in AI systems could be envisaged.
Conclusion
SAEs serve as a lens for interpretability in neural activations, but their dependence on random initializations challenges the assumption of derived features being universally representative. This insight opens theoretical and practical questions about model interpretability, feature interoperability, and the underlying structures of neural activations, framing a pathway for future research to demystify these complex phenomena.