Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 (2408.05147v2)

Published 9 Aug 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

PDF HTML Abstract

Overview of "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2"

The paper "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" by Tom Lieberum et al. introduces an extensive suite of sparse autoencoders (SAEs) called Gemma Scope, designed to enhance the interpretability and safety research capabilities in neural networks, specifically those in the Gemma 2 models.

Sparse autoencoders (SAEs) are unsupervised techniques for decomposing neural network latent representations into interpretable features, making them a valuable asset in understanding and improving model behavior. However, training these autoencoders at scale is computationally expensive, a challenge this paper addresses by providing a comprehensive dataset and suite of pretrained SAEs for community use.

Key Contributions

Comprehensive SAE Suite:
- Gemma Scope comprises over 400 SAEs, covering every layer and sublayer of Gemma 2 2B and 9B, and select layers of the 27B base models.
- Included are multiple SAEs with varying sparsity levels, resulting in over 2,000 individual autoencoders.
Implementation and Release:
- The authors release the pretrained weights and accompanying resources on platforms like HuggingFace and Neuronpedia, democratizing access and enabling broader research applications.
- Extensive evaluations provide metrics on the quality and performance of these SAEs.
Engineering Efforts:
- The training process leveraged significant computational resources, including TPUv3 and TPUv5p clusters, and optimized data pipelines to handle large-scale activation storage and high-throughput data loading.

Methodology

Sparse Autoencoders

The core idea behind SAEs is to reconstruct input activations using sparse, non-negative latent vectors. The authors focused on JumpReLU SAEs because of their superior performance in sparsity and reconstruction fidelity. The JumpReLU activation function thresholds activations to zero if they fall below a certain value, allowing dynamic adjustment of active latents.

The encoder and decoder functions in SAEs are defined as follows:

The encoder function applies a linear transformation followed by a non-linear activation.
The decoder function reconstructs the input from the latent space using another linear transformation.

Training Process

Training these SAEs involved:

Utilizing large activation datasets derived from the Gemma 1 pretraining distribution.
Employing various optimization strategies, including Adam optimizer with specific hyperparameters (e.g., learning rate, bandwidth for gradient estimates).
Differential sharding techniques to balance the computational load and memory usage efficiently.
Addressing challenges such as high data throughput needs via a shared server system for data distribution.

Evaluation

The paper extensively evaluates the trained SAEs on several aspects:

Sparsity-Fidelity Trade-Off:
- Measures the trade-off between the sparsity of latent activations and the reconstruction fidelity.
- Fidelity is evaluated using delta LLM (LM) loss and fraction of variance unexplained (FVU).
- Residual stream SAEs exhibited higher delta losses due to the critical role of residual streams in the model's internal communications.
Sequence Position Impact:
- Analyzes reconstruction loss and delta loss across different sequence positions in input data.
- Early tokens generally showed lower reconstruction loss, with slight variances between attention and MLP SAEs.
SAE Width Effect:
- Investigates how varying the width of SAEs affects their performance, highlighting the 'feature-splitting' phenomenon where wider SAEs tend to decompose features into more specialized representations.
Practical Utility:
- Examines the interpretability of latent features, confirmed through human-rater evaluations and LM-generated explanations.
- Evaluates SAEs trained on base models for their applicability to instruction-tuned (IT) models, showcasing their robustness in different finetuned settings.

Implications and Future Directions

The release of Gemma Scope significantly lowers the barrier to ambitious safety and interpretability research by providing high-quality, pretrained SAEs that can be directly utilized or further finetuned. The extensive evaluations highlight areas where SAEs perform well, such as detecting meaningful features within model activations, and point to opportunities for improving interpretability techniques.

Future research directions could include:

Deepening the understanding of SAE latent feature structures and their relationships across layers.
Enhancing SAE training methodologies to better capture complex model behaviors and interactions.
Applying these SAEs to practical tasks such as model debugging, adversarial robustness, and safety assurance.

By offering a well-engineered, widely accessible tool, Gemma Scope paves the way for advancements in understanding and managing the complexities of LLMs.