Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

Published 28 Oct 2024 in cs.CL and cs.AI | (2410.21508v1)

Abstract: Sparse AutoEnocders (SAEs) have recently been employed as an unsupervised approach for understanding the inner workings of LLMs. They reconstruct the model's activations with a sparse linear combination of interpretable features. However, training SAEs is computationally intensive, especially as models grow in size and complexity. To address this challenge, we propose a novel training strategy that reduces the number of trained SAEs from one per layer to one for a given group of contiguous layers. Our experimental results on Pythia 160M highlight a speedup of up to 6x without compromising the reconstruction quality and performance on downstream tasks. Therefore, layer clustering presents an efficient approach to train SAEs in modern LLMs.