Gemma Scope: Sparse Autoencoder Suite
- Gemma Scope is a suite of sparse autoencoders that enables unsupervised dictionary learning in transformer-based large language models.
- It leverages a JumpReLU architecture with rigorous L0 sparsity constraints to reconstruct activations and assess feature stability across training regimes.
- The resource provides over 400 SAE checkpoints, detailed tutorials, and performance metrics for circuit discovery, interpretability, and safety diagnostics.
Gemma Scope refers to a publicly released, comprehensive suite of sparse autoencoders (SAEs) specifically trained on all layers and sub-layers of the Gemma 2 family of LLMs—including the 2B, 9B, and selected 27B parameter variants. The project enables large-scale, unsupervised dictionary learning for mechanistic interpretability and safety research in transformer-based LLMs, providing open access to >400 SAE checkpoints, detailed tutorials, and evaluation metrics. Gemma Scope is centered on a JumpReLU-based SAE architecture and is positioned as an enabling resource for quantitative feature analysis, circuit discovery, and practical safety diagnostics in current-generation open models (Lieberum et al., 2024).
1. JumpReLU Sparse Autoencoder Architecture
Gemma Scope uses JumpReLU SAEs, which learn a sparse, non-negative, high-dimensional code for an activation vector (), and reconstruct from this sparse code via a linear decoder: where , , and is the JumpReLU nonlinearity: with the Heaviside function and learnable thresholds . Below-threshold values are zeroed, enforcing strict sparsity and non-negativity.
Training minimizes
where counts the nonzero latents, and serves as a sparsity controller. The piecewise-constant effects of JumpReLU and the penalty are handled with a straight-through gradient estimator and careful kernel-density smoothing ().
2. Training Methodology and Coverage
Training data comprises model activations sampled from Gemma 2 (pre-trained, base, and instruction-tuned variants) with activation vectors shuffled and individually normalized. Three distinct "sites" are targeted within each transformer block:
- Attention output (concatenated heads, pre-output and pre-RMSNorm)
- MLP output (post-MLP, post-RMSNorm)
- Residual stream (post-MLP, pre-next block)
Model widths vary from (16K) up to (1M) latents, and all layers of 2B and 9B models are covered, with select depths in 27B. Key global optimizations include consistent learning rate schedules, batch size 4096, normalized decoders, threshold warmup, and massive-scale hardware support (TPUv3/v5p with data-parallel sharding and high-throughput streaming).
Training targets either raw pre-trained (PT) activations or instruction-tuned (IT) rollouts; the latter verify the transferability of learned features across finetuning regimes. All weights are distributed under CC-BY-4.0 with ready-to-use pipelines and interactive demos (Lieberum et al., 2024).
3. Evaluation: Sparsity–Fidelity Tradeoffs and Performance Metrics
Key performance dimensions include:
- Sparsity: -norm, average active latents per input
- Reconstruction fidelity:
- Delta LM loss: Change in next-token cross-entropy when reconstructed replaces
- Fraction of variance unexplained (FVU):
- Interpretability:
- Human rating (scale 1–5) of semantic coherence of top-firing features
- LLM–simulated activation correlation (Pearson )
- Importance uniformity: Effective number of features (Olah et al.)
On mid-network sites (2B@l12, 9B@l20) at width 131K and :
- Delta LM loss for attention SAEs ≈ $0.015$; MLP ≈ $0.03$; residuals ≈ $0.08$
- FVU ≈ $0.10$–$0.15$
- Interpretability ratings ≈ $3.8$/5 for JumpReLU, matching or exceeding TopK and Gated SAEs
- Simulated-activation Pearson ≈ $0.5$
- Layerwise performance is uniform; width increases strengthen low-frequency feature discovery
Instruction-tuned SAEs reconstruct IT activations as effectively as PT-trained ones, confirming feature-set stability across supervised finetuning. Loss and FVU curves reveal position-sensitivity: first ten tokens are reconstructed least faithfully; subsequent tokens plateau.
4. Usage, Access, and API Integration
Weights, evaluation metrics, and tutorials are released on Hugging Face (gemma-scope), and interactive visualization is available via Neuronpedia.
Example usage for TransformerLens:
1 2 3 4 |
import transformer_lens as tl model = tl.HookedTransformer.from_pretrained("google/gemma-2-9b") sae = torch.load("path/to/gemma-scope-9b-pt-res/layer_20/width_131k/sae.pt") model.add_hook("blocks.20.mlp.hook_post", lambda x, _: sae(x)) |
1 2 3 4 5 |
from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "google/gemma-2-9b", sae_overrides={"residual_20": "google/gemma-scope-9b-pt-res-layer20-131k"} ) |
5. Implications for Interpretability and Safety Research
Gemma Scope enables:
- Non-interfering circuit analysis: Isolating sparse feature sets linked to behaviors, e.g., hallucinations, jailbreak triggers, or harmful pattern completions
- Standardized benchmarking of SAE methods: Side-by-side comparisons of JumpReLU, TopK, Gated, or future innovations across Δ loss, FVU, feature frequency, and human ratings
- Quantitative study of width vs. sparsity and activation-frequency distributions, directly informing the superposition hypothesis and cross-layer feature stability
- Robustness diagnostics: Analysis of invariance and transferability between pre-trained and instruction-tuned feature sets, facilitating robust safe prompt engineering and red-teaming detection
6. Limitations and Future Research Directions
Gemma Scope is primarily bounded by hardware pipelines—the activations per model/layer reach 100 TiB for 32-bit float and storage/streaming bottlenecks are nontrivial. While core metrics (Δ loss, FVU, human/LM interpretability) are widely used, they do not directly capture higher-level semantic or causal abstraction. A plausible implication is that as model scale and dataset diversity increase, further SAE variants or hybrid approaches (e.g., cross-modal or cross-family transfer) will be needed to fully exploit the interpretability landscape. Studies of how width scaling, layer depth, and SAE site selection interact with linguistic or behavioral specificity also remain open.
Gemma Scope establishes a scalable, open, and reproducible framework for large-scale mechanistic interpretability research, providing a platform for both theoretical exploration of dictionary learning and practical tools for LLM alignment and safety (Lieberum et al., 2024).