Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemma Scope: Sparse Autoencoder Suite

Updated 17 March 2026
  • Gemma Scope is a suite of sparse autoencoders that enables unsupervised dictionary learning in transformer-based large language models.
  • It leverages a JumpReLU architecture with rigorous L0 sparsity constraints to reconstruct activations and assess feature stability across training regimes.
  • The resource provides over 400 SAE checkpoints, detailed tutorials, and performance metrics for circuit discovery, interpretability, and safety diagnostics.

Gemma Scope refers to a publicly released, comprehensive suite of sparse autoencoders (SAEs) specifically trained on all layers and sub-layers of the Gemma 2 family of LLMs—including the 2B, 9B, and selected 27B parameter variants. The project enables large-scale, unsupervised dictionary learning for mechanistic interpretability and safety research in transformer-based LLMs, providing open access to >400 SAE checkpoints, detailed tutorials, and evaluation metrics. Gemma Scope is centered on a JumpReLU-based SAE architecture and is positioned as an enabling resource for quantitative feature analysis, circuit discovery, and practical safety diagnostics in current-generation open models (Lieberum et al., 2024).

1. JumpReLU Sparse Autoencoder Architecture

Gemma Scope uses JumpReLU SAEs, which learn a sparse, non-negative, high-dimensional code f(x)∈RMf(x)\in\mathbb{R}^M for an activation vector x∈Rnx\in\mathbb{R}^n (M≫nM\gg n), and reconstruct xx from this sparse code via a linear decoder: f(x)=σ(Wencx+benc),x^=Wdecf(x)+bdecf(x) = \sigma(W_{\rm enc} x + b_{\rm enc}),\quad \hat{x} = W_{\rm dec} f(x) + b_{\rm dec} where Wenc∈RM×nW_{\rm enc} \in \mathbb{R}^{M\times n}, Wdec∈Rn×MW_{\rm dec} \in \mathbb{R}^{n\times M}, and σ\sigma is the JumpReLU nonlinearity: σ(z)=z⊙H(z−θ),θ>0\sigma(z) = z \odot H(z - \boldsymbol{\theta}),\quad \boldsymbol{\theta} > 0 with HH the Heaviside function and learnable thresholds θ∈RM\boldsymbol{\theta}\in\mathbb{R}^M. Below-threshold values are zeroed, enforcing strict sparsity and non-negativity.

Training minimizes

L(x)=∥x−x^(f(x))∥22+λ∥f(x)∥0\mathcal{L}(x) = \| x - \hat{x}(f(x)) \|_2^2 + \lambda \|f(x)\|_0

where ∥f(x)∥0\|f(x)\|_0 counts the nonzero latents, and λ>0\lambda>0 serves as a sparsity controller. The piecewise-constant effects of JumpReLU and the L0L_0 penalty are handled with a straight-through gradient estimator and careful kernel-density smoothing (ϵ=10−3\epsilon=10^{-3}).

2. Training Methodology and Coverage

Training data comprises model activations sampled from Gemma 2 (pre-trained, base, and instruction-tuned variants) with activation vectors shuffled and individually normalized. Three distinct "sites" are targeted within each transformer block:

  • Attention output (concatenated heads, pre-output and pre-RMSNorm)
  • MLP output (post-MLP, post-RMSNorm)
  • Residual stream (post-MLP, pre-next block)

Model widths vary from 2142^{14} (16K) up to 2202^{20} (1M) latents, and all layers of 2B and 9B models are covered, with select depths in 27B. Key global optimizations include consistent learning rate schedules, batch size 4096, normalized decoders, threshold warmup, and massive-scale hardware support (TPUv3/v5p with data-parallel sharding and high-throughput streaming).

Training targets either raw pre-trained (PT) activations or instruction-tuned (IT) rollouts; the latter verify the transferability of learned features across finetuning regimes. All weights are distributed under CC-BY-4.0 with ready-to-use pipelines and interactive demos (Lieberum et al., 2024).

3. Evaluation: Sparsity–Fidelity Tradeoffs and Performance Metrics

Key performance dimensions include:

  • Sparsity: L0L_0-norm, average active latents per input
  • Reconstruction fidelity:
    • Delta LM loss: Change in next-token cross-entropy when reconstructed x^\hat{x} replaces xx
    • Fraction of variance unexplained (FVU): MSE/Var(x)\textrm{MSE}/\textrm{Var}(x)
  • Interpretability:
    • Human rating (scale 1–5) of semantic coherence of top-firing features
    • LLM–simulated activation correlation (Pearson rr)
  • Importance uniformity: Effective number of features (Olah et al.)

On mid-network sites (2B@l12, 9B@l20) at width 131K and L0≈50L_0\approx 50:

  • Delta LM loss for attention SAEs ≈ $0.015$; MLP ≈ $0.03$; residuals ≈ $0.08$
  • FVU ≈ $0.10$–$0.15$
  • Interpretability ratings ≈ $3.8$/5 for JumpReLU, matching or exceeding TopK and Gated SAEs
  • Simulated-activation Pearson rr ≈ $0.5$
  • Layerwise performance is uniform; width increases strengthen low-frequency feature discovery

Instruction-tuned SAEs reconstruct IT activations as effectively as PT-trained ones, confirming feature-set stability across supervised finetuning. Loss and FVU curves reveal position-sensitivity: first ten tokens are reconstructed least faithfully; subsequent tokens plateau.

4. Usage, Access, and API Integration

Weights, evaluation metrics, and tutorials are released on Hugging Face (gemma-scope), and interactive visualization is available via Neuronpedia.

Example usage for TransformerLens:

1
2
3
4
import transformer_lens as tl
model = tl.HookedTransformer.from_pretrained("google/gemma-2-9b")
sae = torch.load("path/to/gemma-scope-9b-pt-res/layer_20/width_131k/sae.pt")
model.add_hook("blocks.20.mlp.hook_post", lambda x, _: sae(x))
Or with HuggingFace Transformers (pseudo-API):
1
2
3
4
5
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    sae_overrides={"residual_20": "google/gemma-scope-9b-pt-res-layer20-131k"}
)
All released SAEs are directly pluggable as post-processing modules or as hooks during forward passes in the supported model architectures.

5. Implications for Interpretability and Safety Research

Gemma Scope enables:

  • Non-interfering circuit analysis: Isolating sparse feature sets linked to behaviors, e.g., hallucinations, jailbreak triggers, or harmful pattern completions
  • Standardized benchmarking of SAE methods: Side-by-side comparisons of JumpReLU, TopK, Gated, or future innovations across Δ loss, FVU, feature frequency, and human ratings
  • Quantitative study of width vs. sparsity and activation-frequency distributions, directly informing the superposition hypothesis and cross-layer feature stability
  • Robustness diagnostics: Analysis of invariance and transferability between pre-trained and instruction-tuned feature sets, facilitating robust safe prompt engineering and red-teaming detection

6. Limitations and Future Research Directions

Gemma Scope is primarily bounded by hardware pipelines—the activations per model/layer reach ∼\sim100 TiB for 32-bit float and storage/streaming bottlenecks are nontrivial. While core metrics (Δ loss, FVU, human/LM interpretability) are widely used, they do not directly capture higher-level semantic or causal abstraction. A plausible implication is that as model scale and dataset diversity increase, further SAE variants or hybrid approaches (e.g., cross-modal or cross-family transfer) will be needed to fully exploit the interpretability landscape. Studies of how width scaling, layer depth, and SAE site selection interact with linguistic or behavioral specificity also remain open.

Gemma Scope establishes a scalable, open, and reproducible framework for large-scale mechanistic interpretability research, providing a platform for both theoretical exploration of dictionary learning and practical tools for LLM alignment and safety (Lieberum et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemma Scope.