Interpretable Feature Circuits Discovered by Transcoders in LLMs
This essay discusses the core contributions and methodologies presented in the paper, "Transcoders Find Interpretable LLM Feature Circuits" by Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. The paper aims to advance mechanistic interpretability in LLMs by introducing and exploring transcoders. Transcoders are designed to approximate the densely activating MLP layers in transformers with wider, sparsely-activating MLP layers. This approximation facilitates fine-grained circuit analysis, enabling the identification of interpretable feature circuits responsible for model behaviors.
Key Contributions
The primary contributions of the paper are manifold:
- Introduction of Transcoders: The authors present transcoders as a novel tool to approximate MLP layers in transformer models. These transcoders are trained with an L1 regularization penalty to encourage sparsity, which aids interpretability without sacrificing the fidelity of the original model's computations.
- Comparison with Sparse Autoencoders (SAEs): Extensive evaluations demonstrate that transcoders perform on par with or better than SAEs regarding interpretability, sparsity, and faithfulness.
- Circuit-Finding Methodology: A novel method for using transcoders in circuit analysis is introduced, leveraging the disentangling property of transcoders to cleanly factorize circuits into input-dependent and input-invariant terms.
- Empirical Evaluations: The paper provides empirical evidence by applying transcoders to various tasks and models, such as reverse-engineering the "greater-than circuit" in GPT2-small, and detailed case studies that showcase the practical utility of transcoders in mechanistic interpretability.
Methodology
Transcoder Training and Architecture
Transcoders extend the dense MLP sublayers to wide and sparse alternatives. The training process involves minimizing a loss function that balances faithfulness (matching the MLP output) and sparsity (L1 penalty on neuron activations). The architecture comprises an encoder-decoder setup within a single hidden layer MLP: zTC=ReLU(Wencx+benc)
TC(x)=WdeczTC+bdec
Comparison to SAEs
Transcoders are evaluated against SAEs trained on MLP outputs across multiple LLMs, including GPT2-small, Pythia-410M, and Pythia-1.4B. The metrics used for evaluation include interpretability (human-judged), sparsity (mean L0 norm of activations), and faithfulness (cross-entropy loss difference when transcoders replace MLPs).
Circuit Analysis with Transcoders
The paper introduces a method to perform circuit analysis using transcoders:
- Attribution Calculation: Attribution of earlier-layer features to later features is computed by the product of feature activations and dot products of encoder and decoder vectors.
- Computational Subgraphs: Important computational paths are identified by analyzing attributions iteratively.
- De-Embeddings: De-embedding vectors are used to determine the direct effect of input tokens on transcoder features, providing input-invariant insights into model behavior.
Empirical Results
Blind Case Studies
Several blind case studies are conducted, where the authors infer the semantics of hidden features purely through circuit analysis. One notable paper involved reverse-engineering a feature in a GPT2-small transcoder and correctly identifying it as a semicolon in citation patterns.
Greater-Than Circuit in GPT2-small
The authors revisit the "greater-than circuit" previously analyzed by \citet{hanna_how_2023}. Using transcoders, they not only corroborate earlier findings but also identify relevant MLP10 features and how these features contribute to the model's behavior when predicting sequential years. They demonstrate that transcoders provide a sparser and more interpretable computational subgraph compared to traditional neuronal analysis.
Implications and Speculations on Future Developments
Practical Implications
The introduction of transcoders has significant practical implications for debugging and understanding LLM behaviors. By providing a clear and sparse approximation of MLP sublayers, transcoders make fine-grained circuit analysis more tractable. This can lead to better model interpretability, facilitate identification of emergent behaviors, and potentially guide the development of more reliable and controllable AI systems.
Theoretical Implications
The ability of transcoders to disentangle input-dependent and input-invariant components of model behavior offers a profound theoretical tool for understanding neural networks. This factorization might enable the formulation of new hypotheses about how higher-level cognitive tasks are represented within transformer models.
Future Directions
Future research could explore extensions of transcoders to other neural architectures beyond transformers or employ transcoders in understanding attention mechanisms. Additionally, enhancing the scalability of transcoders to larger models and datasets will be crucial for generalizing their applicability.
Conclusion
Transcoders mark a significant step forward in mechanistic interpretability of LLMs, providing a bridge between the dense computations of MLP layers and sparse, human-interpretable circuits. The paper's rigorous methodology and comprehensive evaluations offer compelling evidence of the utility of transcoders in fine-grained circuit analysis, providing both practical tools and theoretical insights into deep model behaviors.