Transcoders Find Interpretable LLM Feature Circuits (2406.11944v2)

Published 17 Jun 2024 in cs.LG and cs.CL

Abstract: A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based LLMs difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on LLMs with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits/.

View on arXiv

Authors (3)

Jacob Dunefsky (4 papers)
Philippe Chlenski (7 papers)
Neel Nanda (50 papers)

Citations (8)

View on Semantic Scholar

Summary

Interpretable Feature Circuits Discovered by Transcoders in LLMs

This essay discusses the core contributions and methodologies presented in the paper, "Transcoders Find Interpretable LLM Feature Circuits" by Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. The paper aims to advance mechanistic interpretability in LLMs by introducing and exploring transcoders. Transcoders are designed to approximate the densely activating MLP layers in transformers with wider, sparsely-activating MLP layers. This approximation facilitates fine-grained circuit analysis, enabling the identification of interpretable feature circuits responsible for model behaviors.

Key Contributions

The primary contributions of the paper are manifold:

Introduction of Transcoders: The authors present transcoders as a novel tool to approximate MLP layers in transformer models. These transcoders are trained with an L1 regularization penalty to encourage sparsity, which aids interpretability without sacrificing the fidelity of the original model's computations.
Comparison with Sparse Autoencoders (SAEs): Extensive evaluations demonstrate that transcoders perform on par with or better than SAEs regarding interpretability, sparsity, and faithfulness.
Circuit-Finding Methodology: A novel method for using transcoders in circuit analysis is introduced, leveraging the disentangling property of transcoders to cleanly factorize circuits into input-dependent and input-invariant terms.
Empirical Evaluations: The paper provides empirical evidence by applying transcoders to various tasks and models, such as reverse-engineering the "greater-than circuit" in GPT2-small, and detailed case studies that showcase the practical utility of transcoders in mechanistic interpretability.

Methodology

Transcoder Training and Architecture

Transcoders extend the dense MLP sublayers to wide and sparse alternatives. The training process involves minimizing a loss function that balances faithfulness (matching the MLP output) and sparsity (L1 penalty on neuron activations). The architecture comprises an encoder-decoder setup within a single hidden layer MLP: $z_{TC} = ReLU(W_{enc} x + b_{enc})$

$TC(x) = W_{dec} z_{TC} + b_{dec}$

Comparison to SAEs

Transcoders are evaluated against SAEs trained on MLP outputs across multiple LLMs, including GPT2-small, Pythia-410M, and Pythia-1.4B. The metrics used for evaluation include interpretability (human-judged), sparsity (mean L0 norm of activations), and faithfulness (cross-entropy loss difference when transcoders replace MLPs).

Circuit Analysis with Transcoders

The paper introduces a method to perform circuit analysis using transcoders:

Attribution Calculation: Attribution of earlier-layer features to later features is computed by the product of feature activations and dot products of encoder and decoder vectors.
Computational Subgraphs: Important computational paths are identified by analyzing attributions iteratively.
De-Embeddings: De-embedding vectors are used to determine the direct effect of input tokens on transcoder features, providing input-invariant insights into model behavior.

Empirical Results

Blind Case Studies

Several blind case studies are conducted, where the authors infer the semantics of hidden features purely through circuit analysis. One notable paper involved reverse-engineering a feature in a GPT2-small transcoder and correctly identifying it as a semicolon in citation patterns.

Greater-Than Circuit in GPT2-small

The authors revisit the "greater-than circuit" previously analyzed by \citet{hanna_how_2023}. Using transcoders, they not only corroborate earlier findings but also identify relevant MLP10 features and how these features contribute to the model's behavior when predicting sequential years. They demonstrate that transcoders provide a sparser and more interpretable computational subgraph compared to traditional neuronal analysis.

Implications and Speculations on Future Developments

Practical Implications

The introduction of transcoders has significant practical implications for debugging and understanding LLM behaviors. By providing a clear and sparse approximation of MLP sublayers, transcoders make fine-grained circuit analysis more tractable. This can lead to better model interpretability, facilitate identification of emergent behaviors, and potentially guide the development of more reliable and controllable AI systems.

Theoretical Implications

The ability of transcoders to disentangle input-dependent and input-invariant components of model behavior offers a profound theoretical tool for understanding neural networks. This factorization might enable the formulation of new hypotheses about how higher-level cognitive tasks are represented within transformer models.

Future Directions

Future research could explore extensions of transcoders to other neural architectures beyond transformers or employ transcoders in understanding attention mechanisms. Additionally, enhancing the scalability of transcoders to larger models and datasets will be crucial for generalizing their applicability.

Conclusion

Transcoders mark a significant step forward in mechanistic interpretability of LLMs, providing a bridge between the dense computations of MLP layers and sparse, human-interpretable circuits. The paper's rigorous methodology and comprehensive evaluations offer compelling evidence of the utility of transcoders in fine-grained circuit analysis, providing both practical tools and theoretical insights into deep model behaviors.

Related Papers

Find Related Papers

GitHub

GitHub - jacobdunefsky/transcoder_circuits (43 stars)

Tweets

https://twitter.com/jacobdunefsky/status/1805696030396268569

https://twitter.com/iScienceLuvr/status/1803286470209462707

https://twitter.com/NeelNanda5/status/1867111788346741178

https://twitter.com/norabelrose/status/1887972444876738563

https://twitter.com/GptMaestro/status/1804268095613341780

https://twitter.com/culpritgene/status/1835675677195932159

YouTube

Show All Videos