Sparse Autoencoders Trained on the Same Data Learn Different Features (2501.16615v2)

Published 28 Jan 2025 in cs.LG

Abstract: Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of LLMs. While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model.

Summary

The paper demonstrates that sparse autoencoders trained with different random seeds uncover only ~30% shared features, highlighting significant seed-dependent variability.
It employs the Hungarian algorithm for optimal feature alignment, revealing a bimodal cosine similarity distribution between consistently shared and orphan features.
The results challenge the assumption of a canonical feature set in neural networks, urging a more nuanced interpretation of model activations.

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse Autoencoders (SAEs) have emerged as a pivotal tool in elucidating the interpretability of activations within LLMs. This paper scrutinizes the behaviour of SAEs trained on identical data with varying random seeds and scrutinizes the implications of its findings on the understanding of neural network features. Through this examination, the paper argues that distinct random seeds lead SAEs to uncover disparate sets of features, challenging the expectation that SAEs can reveal a canonical set of features used by a model.

Results Overview

The paper presents compelling evidence from experiments utilizing multiple SAE architectures trained on various LLMs and datasets. In particular, an extensive empirical paper involving an SAE with 131,000 latents trained on Llama 3 8B demonstrates that only 30% of the features were consistent across different seeds. This finding is consistent across different layers of three LLMs, underscoring the seed-dependence of features identified by SAEs. Furthermore, it highlights a distinctive variability among contemporary activation functions: while SAEs with ReLU activations exhibited greater robustness across seeds, those employing the state-of-the-art TopK activation were more influenced by the initialization.

Methodological Approach

To address the challenge of differing feature alignments due to random initializations, the authors adopted the Hungarian algorithm to achieve an optimal bijective matching of features. This alignment metric was essential to quantifying the degree of overlap between distinct SAEs. The analysis revealed a bimodal distribution of cosine similarities for aligned features, distinguishing between high-similarity "shared" features and low-similarity "orphan" features.

Implications and Speculation

The implications of this research are manifold. Practically, it suggests caution in interpreting the feature set obtained from a single SAE as exhaustive or definitive, thus advising a more application-specific, pragmatic use of such decompositions. From a theoretical vantage, these findings contest the presupposition of an objective decomposition of neural networks into universal feature sets; instead, features extracted may vary based on the training run.

This variability invites further research into factors influencing feature emergence, including but not limited to data characteristics, model scale, and initialization. The potential convergence of SAE-discovered features and their dependence on specific neural architectures are promising future avenues for exploration. Moreover, by understanding such variability, enhancements in model interpretability and robustness evaluations in AI systems could be envisaged.

Conclusion

SAEs serve as a lens for interpretability in neural activations, but their dependence on random initializations challenges the assumption of derived features being universally representative. This insight opens theoretical and practical questions about model interpretability, feature interoperability, and the underlying structures of neural activations, framing a pathway for future research to demystify these complex phenomena.

Related Papers

Tweets

https://twitter.com/banburismus_/status/1886126457564361035

https://twitter.com/norabelrose/status/1886102260608041347

https://twitter.com/leedsharkey/status/1886041683155890363

https://twitter.com/juliusadml/status/1915762553160188020

https://twitter.com/mnagai_/status/1889748757941514268

https://twitter.com/SeonglaeC/status/1937661236285415826

YouTube

Show All Videos