Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control (2405.08366v3)
Abstract: Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.
- Can language models encode perceptual structure without grounding? a case study in color. arXiv preprint arXiv:2109.06129, 2021.
- Understanding intermediate layers using linear classifier probes. ArXiv, abs/1610.01644, 2016. URL https://api.semanticscholar.org/CorpusID:9794990.
- Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
- Layer normalization. ArXiv, abs/1607.06450, 2016. URL https://api.semanticscholar.org/CorpusID:8236317.
- Leace: Perfect linear concept erasure in closed form. ArXiv, abs/2306.03819, 2023. URL https://api.semanticscholar.org/CorpusID:259088549.
- Pythia: A suite for analyzing large language models across training and scaling. ArXiv, abs/2304.01373, 2023. URL https://api.semanticscholar.org/CorpusID:257921893.
- Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
- An interpretability illusion for bert. arXiv preprint arXiv:2104.07143, 2021.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
- Toy models of superposition. Transformer Circuits Thread, 2022a. URL https://transformer-circuits.pub/2022/toy_model/index.html.
- Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022b.
- Sparse overcomplete word vector representations. In Annual Meeting of the Association for Computational Linguistics, 2015. URL https://api.semanticscholar.org/CorpusID:9397697.
- Interpreting clip’s image representation via text-based decomposition. arXiv preprint arXiv:2310.05916, 2023.
- Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023.
- Gabriel Goh. Decoding the representation of code in the brain: an fmri study of code review and expertise. 2016. URL https://gabgoh.github.io/ThoughtVectors/.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Successor heads: Recurring, interpretable attention heads in the wild. ArXiv, abs/2312.09230, 2023. URL https://api.semanticscholar.org/CorpusID:266210012.
- Semantic projection: Recovering human knowledge of multiple, distinct object features from word embeddings. arxiv. arXiv preprint arXiv:1802.01241, 2018.
- Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023.
- A circuit for Python docstrings in a 4-layer attention-only transformer. URL https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only.
- Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439–561, 2006.
- Attention saes scale to gpt-2 small. Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr.
- Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737, 2021.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023a.
- Circuit breaking: Removing model behaviors with targeted ablation. arXiv preprint arXiv:2309.05973, 2023b.
- Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
- Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024.
- The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 2022.
- Distributed representations of words and phrases and their compositionality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119, 2013a. URL https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
- Linguistic regularities in continuous space word representations. In North American Chapter of the Association for Computational Linguistics, 2013b. URL https://api.semanticscholar.org/CorpusID:7478738.
- Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023.
- Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001.
- Christopher Olah. Interpretability dreams. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/interpretability-dreams/index.html.
- Circuits updates - january 2024. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update.
- Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37:3311–3325, 1997. URL https://api.semanticscholar.org/CorpusID:14208692.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- OpenAI. Gpt-4 technical report, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Taking the temperature of transformer circuits. 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
- Spine: Sparse interpretable neural embeddings. ArXiv, abs/1711.08792, 2017. URL https://api.semanticscholar.org/CorpusID:19143983.
- Codebook features: Sparse and discrete interpretability for neural networks. arXiv preprint arXiv:2310.17230, 2023.
- Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154, 2023.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265, 2020.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul.
- Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out, 2021. URL https://api.semanticscholar.org/CorpusID:232417301.