Quantify how sparse dictionary learning scales with model size
Determine whether the relative scaling costs of applying sparse dictionary learning methods (such as sparse autoencoders, transcoders, and crosscoders) across all layers and vector spaces in large neural networks scale sub-linearly or supra-linearly with model size and layer coverage, and quantify the relative training cost compared to the original model training process.
References
The actual relative cost is unclear since there are no public attempts to apply SDL to every vector space in a model, although some work applies SDL to various layers. As AI models become larger, scaling costs of SDL also increase, although it remains unclear whether relative scaling costs are sub- or supra-linear.
— Open Problems in Mechanistic Interpretability
(2501.16496 - Sharkey et al., 27 Jan 2025) in Reverse engineering step 1: Neural network decomposition — SDL methods are expensive to apply to large models (Section 2.1.1, parasection “SDL methods are expensive to apply to large models”)