Distilled Matryoshka Sparse Autoencoders
- The paper introduces DMSAEs that iteratively distill and freeze high-value encoder features to form a stable, reusable core for sparse autoencoders.
- It employs an attribution-guided selection method using gradient×activation metrics to retain the minimal set covering 90% cumulative attribution.
- Empirical evaluation on Gemma-2-2B shows that the distilled 197-feature core enhances reconstruction, feature transferability, and downstream metric consistency.
Distilled Matryoshka Sparse Autoencoders (DMSAEs) constitute a training pipeline for sparse autoencoders that extracts a compact, transferable core of robust, human-interpretable features. By iteratively distilling and freezing the directions most consistently useful for a base model’s next-token loss, DMSAEs address the instability and redundancy of standard sparse feature learning. The methodology revolves around an attribution-guided selection process, transferring only the distilled core encoder weight vectors across cycles, while reinitializing the decoder and non-core latents. Empirical evaluation on the Gemma-2-2B model demonstrates that DMSAEs yield a strongly reusable core—comprising 197 stabilized features after seven cycles—which improves consistency, interpretability, and downstream SAEBench metrics relative to conventional Matryoshka Sparse Autoencoders (Martin-Linares et al., 31 Dec 2025).
1. Matryoshka Sparse Autoencoders: Hierarchical Feature Learning
A typical sparse autoencoder represents an overcomplete dictionary through an encoder–decoder architecture, subject to a sparsity constraint: where enforces sparsity (commonly through TopK thresholding).
Matryoshka Sparse Autoencoders (MSAEs) extend this framework by introducing a hierarchy of prefix sizes , with reconstruction objectives over all prefixes: where reconstructs with only the first latents. Early latents must encode high-frequency, generalizable content, while later ones specialize. In vanilla MSAEs, no explicit “core” features are preserved across runs, resulting in significant variability and challenging feature reuse (Martin-Linares et al., 31 Dec 2025).
2. Attribution Metric: Gradient × Activation
DMSAEs introduce an attribution-driven basis selection methodology for identifying high-value features. For a given token position , let denote the residual stream activation, and the gradient of next-token loss. Encoding and masking to the smallest prefix, for each latent :
- Compute activation
- Normalize the decoder vector
- Compute
- Attribution score:
Due to the heavy-tailed attribution distribution, DMSAEs aggregate over positions via a high quantile (e.g., ): Latents are sorted by ; the smallest subset whose cumulative attribution exceeds a threshold is retained as the distilled core:
3. Iterative Distillation and Transfer Protocol
The DMSAE pipeline operates as a multi-cycle, iterative distillation process. The key steps are:
- Initialization: Score the released SAEBench model to select the initial core .
- Per-Cycle Training:
- Freeze encoder rows corresponding to ; reinitialize all other parameters.
- Train a two-group MSAE, letting the core latents remain dense while constraining non-core sparsity.
- Upon convergence, compute quantile-based GxA attributions for core and prefix-0 latents.
- Select the smallest core achieving -coverage as above.
- Core Stabilization: After cycles, the final distilled core is , containing only latents persisting through the last two cycles.
This procedure is formalized in the DMSAE high-level pseudocode provided in (Martin-Linares et al., 31 Dec 2025).
4. Distilled Core Convergence and Empirical Evaluation
Applied to Gemma-2-2B, DMSAEs were trained at layer 12 residual streams with dictionary size latents, distillation cycles, and a non-core sparsity of on 500M tokens. Prefix sizes and coverage threshold were used throughout (Martin-Linares et al., 31 Dec 2025).
Across cycles, the selected core stabilized between 200–400 latents. The intersection of the last two cycles, , yielded a distilled core of 197 features persisting across restarts. Empirical evidence indicates that a randomly chosen core of equal size becomes inactive (core ), whereas the distilled core remains active and contributes significantly to loss reduction throughout training. This demonstrates that the distilled directions are systematically high-value.
5. Performance on SAEBench and Downstream Metrics
DMSAEs were benchmarked by transferring the distilled core to new SAEs at multiple sparsity regimes (), freezing only encoder rows and reinitializing all else. SAEBench evaluations included:
- Reconstruction loss
- Fraction of variance explained
- Feature absorption
- RAVEL
- Targeted concept removal
- Spurious correlation removal
- AutoInterp
Results show that across sparsity levels, DMSAEs match or exceed the vanilla MSAE baseline on reconstruction, absorption, and RAVEL, maintaining stable downstream metrics except for a decrease in AutoInterp at the lowest . A sparse-core ablation (global TopK mask optionally including the core) reveals qualitatively similar trends (Martin-Linares et al., 31 Dec 2025).
6. Implications: Interpretability, Transfer, and Model Compression
Distilled Matryoshka Sparse Autoencoders yield several key advantages:
- Feature Transferability: Freezing only the most consistently useful encoder directions enables reliable reuse of core features across sparsity budgets, restarts, and tasks.
- Interpretability: The distilled core produces a compact, monosemantic, and non-redundant basis, stabilizing feature semantics and simplifying manual or automated feature annotation. It also reduces feature splitting and absorption.
- Compression: Only the encoder weight rows require preservation for transfer, facilitating a two-stage compression approach: a stable core “backbone” and a lightweight, re-trainable residual dictionary.
A plausible implication is that DMSAEs enable more modular and interpretable model analysis pipelines by decoupling the stable, attribution-maximizing core from non-core features adaptively specialized for different downstream requirements (Martin-Linares et al., 31 Dec 2025).