Quantify how sparse dictionary learning scales with model size

Determine whether the relative scaling costs of applying sparse dictionary learning methods (such as sparse autoencoders, transcoders, and crosscoders) across all layers and vector spaces in large neural networks scale sub-linearly or supra-linearly with model size and layer coverage, and quantify the relative training cost compared to the original model training process.

Background

Sparse dictionary learning (SDL) has become a leading approach for decomposing model activations into interpretable latents, but training SDL models for many layers of large neural networks is computationally intensive. The paper notes that a small neural network must be trained per layer and that the SDL dictionaries often contain more parameters than the layer they approximate.

Despite early deployments (e.g., on select layers or components), the authors emphasize that the overall cost profile for comprehensive SDL across an entire large model remains uncharacterized. This uncertainty complicates planning for full-network interpretability and motivates a clear scaling cost analysis.

References

The actual relative cost is unclear since there are no public attempts to apply SDL to every vector space in a model, although some work applies SDL to various layers. As AI models become larger, scaling costs of SDL also increase, although it remains unclear whether relative scaling costs are sub- or supra-linear.

— Open Problems in Mechanistic Interpretability (2501.16496 - Sharkey et al., 27 Jan 2025) in Reverse engineering step 1: Neural network decomposition — SDL methods are expensive to apply to large models (Section 2.1.1, parasection “SDL methods are expensive to apply to large models”)

Quantify how sparse dictionary learning scales with model size

Sponsor

Background

References

Related Problems