Scaling and evaluating sparse autoencoders (2406.04093v1)

Published 6 Jun 2024 in cs.LG and cs.AI

Abstract: Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a LLM by reconstructing activations from a sparse bottleneck layer. Since LLMs learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

PDF HTML Abstract

Scaling and Evaluating Sparse Autoencoders

"Scaling and Evaluating Sparse Autoencoders" by Leo Gao, Tom Dupre la Tour, Henk TiLLMan, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, and Jeffrey Wu, presents an innovative approach to training and evaluating extremely wide and sparse autoencoders on large-scale LLM activations. This paper is crucial for extracting interpretable features in high-dimensional data spaces, which are vital for understanding LLM behavior.

Introduction

Sparse autoencoders (SAEs) have demonstrated potential in identifying relevant features and circuits within LLMs. However, conventional training of such models is hindered by extreme sparsity and the phenomenon of dead latents. The authors address these challenges by proposing k-sparse autoencoders utilizing the TopK activation function. This method simplifies tuning by directly controlling sparsity and mitigating the presence of dead latents, thereby producing cleaner scaling laws and improving sparsity-reconstruction trade-offs.

Methods

The novel methodology introduced in this paper includes key modifications:

TopK Activation Function: Direct control over sparsity is gained by using the TopK activation function, which retains only the k-largest activations, thereby eliminating the need for L1 regularization. This approach not only simplifies parameter tuning but also prevents activation shrinkage, which is a common issue with L1 penalties.
Auxiliary Loss (AuxK): To counteract the problem of dead latents in large autoencoders, an auxiliary loss component is introduced. This loss, computed using the top-k dead latents, helps in keeping latents active during the training process.
Initialization Strategies: A specific initialization strategy where the encoder is initialized to the transpose of the decoder ensures consistency and reduces the occurrence of dead latents.

Scaling Laws

The authors systematically explore the scaling behavior of sparse autoencoders. The results demonstrate that the reconstruction mean squared error (MSE) follows clean power laws with respect to both compute budget and autoencoder size. The scaling laws presented in the paper are:

Compute-MSE Frontier: For any given compute budget, the optimal MSE improves systematically as the number of latent dimensions increases.
Irreducible Loss: An irreducible loss term improves the fit quality, capturing the hypothesis that not all activation structures are recoverable, possibly due to intrinsic noise in activations.

Furthermore, experiments reveal that the number of tokens required for convergence scales sublinearly with respect to the number of latent dimensions, implying that greater efficiency can be obtained by scaling up the number of latents.

Evaluation Metrics

In addition to standard MSE, several novel metrics are introduced to evaluate the feature quality:

Downstream Loss: This measures the impact of replacing the residual activations with autoencoder reconstructions on LLMing performance, providing a direct link to practical applications.
Probe Loss: By training logistic probes on the autoencoder latents for various tasks, the authors measure how well the autoencoder captures known features.
Explainability (N2G Method): This assesses how well simple, interpretable patterns can explain latent activations, balancing both recall and precision.
Ablation Sparsity: Measures the sparsity of downstream effects when individual latents are ablated, hypothesizing that natural features exert sparse downstream influences.

These metrics collectively show that larger and sparser autoencoders generally produce better quality features across multiple dimensions.

Practical and Theoretical Implications

The proposed scaling laws and effective evaluation methods bring significant insights into the training of large, sparse autoencoders. On a practical level, the methods and findings can substantially improve the interpretability of LLMs, making them more transparent and trustworthy. Theoretically, the identification of scaling laws and appropriate evaluation metrics sets a foundation for understanding the dynamics of sparse encoding in high-dimensional spaces.

Future Directions

The research opens several avenues for future investigation:

Further Optimization Techniques: Investigating advanced learning rate scheduling and additional optimization strategies may provide further improvements in training efficiency.
Exploring MoE Models: Combining mixture of experts (MoE) with autoencoders could drastically enhance the asymptotic performance, enabling even larger scales.
Monosemanticity Improvements: Developing techniques to increase the monosemanticity of features, especially in models as large as GPT-4.
Broader Evaluation: Expanding the breadth and quality of probe-based tasks will lend more robustness and validity to the probe loss metric.
Longer Context Lengths: Evaluating the autoencoders on longer contexts to uncover more complex patterns and dependencies.

Conclusion

This paper presents a robust methodology for training extremely large, sparse autoencoders, effectively leveraging the TopK activation function and auxiliary loss techniques. By thoroughly evaluating these models with innovative metrics, the authors establish significant scaling laws and demonstrate practical applications in understanding and interpreting LLMs. The implications of this research extend toward making AI systems more interpretable and efficient, potentially driving future advancements in AI alignment and transparency.