Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatically Interpreting Millions of Features in Large Language Models (2410.13928v2)

Published 17 Oct 2024 in cs.LG and cs.CL

Abstract: While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods. We propose guidelines for generating better explanations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. We use our explanations to measure the semantic similarity of independently trained SAEs, and find that SAEs trained on nearby layers of the residual stream are highly similar. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top-$k$ postprocessing. Our code is available at https://github.com/EleutherAI/sae-auto-interp, and our explanations are available at https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.

The provided paper, "Automatically Interpreting Millions of Features in LLMs," presents a novel automated pipeline for generating and evaluating natural language interpretations of sparse autoencoder (SAE) latents. The focus is on using LLMs to facilitate the interpretation of millions of features in a scalable manner. Various aspects of the pipeline and the implications of its findings are explored below:

Key Contributions and Methodology

  1. Sparse Autoencoders (SAEs):
    • SAEs are used to transform the activations of neurons in LLMs into a sparse, higher-dimensional latent space. This space is posited to be more interpretable than the raw neuron activations, addressing the polysemantic nature where individual neurons activate in varied contexts.
  2. Automated Interpretation Pipeline:
    • A comprehensive framework is introduced that leverages LLMs to interpret SAE latents. The pipeline involves collecting activations over a broad dataset, generating interpretations using an LLM explainer model, and scoring these interpretations through multiple techniques.
    • The pipeline can consistently handle varying sizes of SAEs and activation functions, demonstrating flexibility across different models and architectures.
  3. Scoring Techniques:
    • Five new techniques are proposed to evaluate interpretation quality, focusing on cost efficiency. One prominent method, intervention scoring, assesses interpretability by evaluating latents when interventions are applied.
    • These methods are analytically cheaper than the existing simulation scoring, which typically involves aligning simulated neuron activation with real activations.
  4. Evaluation and Guidelines:
    • Insights are provided into generating broad interpretations applicable across different contexts, addressing pitfalls in current scoring approaches. Interpretations are validated against datasets like RedPajama-v2, ensuring their robustness.

Numerical and Qualitative Results

  • Interpretation Performance:
    • The paper emphasizes that interpreting SAE latents, owing to their sparsity and potential monosemantism, leads to more meaningful interpretations compared to interpreting neurons directly.
    • The performance is quantitatively measured using Pearson correlation and Spearman correlation, with high correlation observed between fuzzing and simulation scoring, denoting their efficacy in capturing latent activation behavior.
  • Implementation and Scalability:
    • The cost of interpretation using this framework is significantly reduced, with evaluations showing fuzzing and detection being substantially less expensive than traditional methods.
    • Detailed discussions explain how different configurations—such as the number of examples shown, explainer model size, and context length—affect interpretation quality.

Challenges and Future Directions

  • Polysemantic Latents:
    • While SAEs reduce polysemanticity, not all latents can be fully explained through input correlations alone. The need for causal interpretation, particularly for output-focused latents, is highlighted.
  • Generality vs. Specificity:
    • The paper notes differences in interpretation generality when sampling strategies vary (e.g., top activating examples vs. across quantiles). This variability suggests a need for more nuanced approaches in generalized interpretation tasks.
  • Towards Richer Evaluations:
    • The authors propose moving beyond single-technique evaluation models to a multi-score evaluation paradigm to identify and compensate for potential weaknesses in interpretation scores.

The research advances the state of interpretability for LLMs by making feature-level analysis more accessible and automatable, providing practitioners with a scalable toolset for interpreting the complex activations of contemporary LLMs. The open-sourcing of code and interpretations serves as a valuable resource for continued exploration and refinement in this domain.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Gonçalo Paulo (11 papers)
  2. Alex Mallen (10 papers)
  3. Caden Juang (4 papers)
  4. Nora Belrose (19 papers)
Citations (1)