Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

156 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

113

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control (2405.08366v3)

Published 14 May 2024 in cs.LG

Abstract: Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.

References (51)

Citations (20)

View on Semantic Scholar

Summary

The paper presents a novel framework that benchmarks sparse autoencoders with supervised feature dictionaries for enhanced interpretability in language models.
It demonstrates that task-specific sparse autoencoders capture more interpretable features than general ones, yet still do not match supervised methods.
The study reveals challenges such as feature occlusion and over-splitting, suggesting directions for improving controllability in model behavior.

Evaluating Sparse Feature Dictionaries in LLMs

Introduction

When it comes to understanding what goes on inside LLMs like GPT-2, interpretability is a big deal. The core idea here is to disentangle the complex representations these models use into meaningful features. Recently, sparse autoencoders (SAEs) have been suggested as a promising way to achieve this. The paper we're discussing dives deep into validating these methods by comparing them against dictionaries built using supervised features—essentially, features we "know" to be meaningful beforehand.

What’s the Problem?

We don't have a clear "ground truth" for what these model features should be, making it a tough job to evaluate new methods. This paper's approach is to create a framework for evaluating these feature dictionaries on specific tasks, comparing them against supervised dictionaries for context. They focus on the indirect object identification (IOI) task using GPT-2 Small to find out how well these dictionaries perform.

The Core Approach

To break it down, here’s the framework the researchers proposed:

Supervised Feature Dictionaries: First, they demonstrate that supervised dictionaries can do an excellent job of approximating, controlling, and interpreting model computations.
Contextualizing Unsupervised Dictionaries: They then use these supervised dictionaries to benchmark unsupervised ones (like those learned via SAEs).

Key Results

They trained SAEs on datasets specific to the task (IOI) and a larger, general dataset (OpenWebText). Here’s what they found:

Interpretability: Both task-specific and full-dataset SAEs capture interpretable features, but task-specific ones were better at this.
Sparse Controllability: Supervised dictionaries allowed for more precise editing of features to change model behavior compared to SAEs. Task-specific SAEs fared better than general ones but were still not as good as supervised features.
Occlusion and Over-splitting: They noticed two phenomena in SAE training:
- Feature Occlusion: Important features can be overshadowed by even slightly stronger features.
- Feature Over-splitting: Binary features can get split into several smaller features, making them less interpretable.

Practical Implications

These findings imply that while SAEs can indeed produce some meaningful and interpretable features, there's still a way to go before they can match the precision and control offered by supervised methods. This matters because the better we can understand what our models are doing, the more effectively we can use and refine them.

Future Directions

Future work could involve tuning SAE training procedures to improve their performance or exploring ways to lessen the issues of occlusion and over-splitting. Also, it might be valuable to apply this evaluation framework to a wider range of tasks and models, making it more robust and generalizable.

Conclusion

In a nutshell, the paper provides a robust framework for evaluating the effectiveness of sparse feature dictionaries in LLMs. It sets a benchmark using supervised features and uses it to measure how well unsupervised methods like SAEs stack up. While SAEs show promise, there's still a gap between them and the more reliable supervised methods. The ongoing challenge is to close this gap, making LLMs more interpretable and controllable along the way.

PDF Markdown

Tweets

https://twitter.com/AMakelov/status/1793701961055059991