Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders (2410.20526v1)

Published 27 Oct 2024 in cs.LG and cs.CL

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from LLMs, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

Authors (12)

Zhengfu He (10 papers)
Wentao Shu (4 papers)
Xuyang Ge (9 papers)
Lingjie Chen (6 papers)
Junxuan Wang (4 papers)
Yunhua Zhou (27 papers)
Frances Liu (1 paper)
Qipeng Guo (72 papers)
Xuanjing Huang (287 papers)
Zuxuan Wu (144 papers)
Yu-Gang Jiang (223 papers)
Xipeng Qiu (257 papers)

Summary

Analyzing "Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders"

This paper, "Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders," introduces an extensive suite of Sparse Autoencoders (SAEs) trained on the Llama-3.1-8B-Base model, aiming to enhance both model understanding and the wider interpretability of neural networks. The paper details the design and analysis of 256 SAEs across every layer and sublayer of this large-scale LLM, comparing variations with 32K and 128K features. It evaluates their generalizability, computational efficiency, and potential for revealing model insights within the expansive multi-layer architecture of Llama-3.1-8B.

Core Contributions

The paper delineates several contributions proponents may find pivotal in expanding the usability and efficiency of SAEs in LLMs exceeding 8 billion parameters:

Architecture and Training Modifications: The paper introduces modified Top-K SAEs, in which the 2-norm of the decoder columns is integrated during Top-K computation. Further modifications include advancing Top-K SAEs to JumpReLU variants and employing a K-annealing training schedule—gradually reducing the number of activated features, aiding in smooth model convergence without substantial computational overhead.
Extensive Suite Implementations: The implementation encompasses 256 SAEs trained on various layers and positions within the Llama-3.1-8B infrastructure. This detail allows for a fine-grained interpretability across varied locational computations and opens avenues in layered understanding for both pretraining and fine-tuning methodologies.
Innovative Evaluation Metrics: Evaluation covers a comprehensive spectrum, encapsulating classical metrics such as the sparsity-fidelity trade-off and latent firing frequency. Importantly, the paper examines the geometry of learned SAE latents, determining their potential for feature discovery, and uniquely addressing cross-model generalization.
Out-of-Distribution Generalization: The SAEs were assessed on their adaptability beyond the training distribution regarding sequence length expansion and on instruction-finetuned models—showcasing a positive outcome of minimal performance degradation, which extends their utility beyond base training outputs.
Open-Source Accessibility: By making the SAE checkpoints publicly available, along with scalable training and visualization tools, this paper promotes a collaborative, flexible research ecosystem geared for mechanistic interpretability. This availability significantly reduces redundant training endeavors across the research community and facilitates a shared language for SAE feature extraction.

Implications and Future Directions

The analysis posits significant implications for both theoretical AI research and practical model applications:

Theoretical Implications: The detailed exploration of feature geometry, particularly within massive LLMs like Llama-3.1-8B, broadens the paradigm for achieving mechanistic interpretability. Sparse representations assist in isolating monosemantic units which, in turn, standardizes the conversation across model analyses—allowing both feature comparisons and unified interpretability metrics.
Practical Applications: With augmentations in efficiency and scalability, SAEs demonstrated in Llama Scope offer enhanced methodologies for contextual insights necessary for debugging, optimizing, and developing safety mechanisms in LLMs.

Future research directions will likely focus on expanding architectures to accommodate models with even larger parameter spaces and identifying implementation efficiencies in varied model paradigms such as Mixture-of-Experts (MoE). Extending the universality of neuron-level analysis to encompass these SAEs could significantly further understanding LLMs' knowledge representation.

Overall, "Llama Scope" sets a robust platform from which to explore interpretability within deep neural networks, substantially contributing to the research continuum in AI model introspection.

PDF Markdown

Related Papers

GitHub

GitHub - OpenMOSS/Language-Model-SAEs: For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. (33 stars)