The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks (2405.10928v2)
Abstract: Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to LLMs. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to LLMs.
- Representation learning: A review and new perspectives, 2014.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Degeneracy in the loss landscape can be used to identify the algorithm implemented by a neural network, May 2024a. URL https://publications.apolloresearch.ai/degeneracy_loss_landscape_interpretability.
- Interpretability: Integrated gradients is a decent attribution method, May 2024b. URL https://publications.apolloresearch.ai/integrated_gradients_criteria.
- Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
- Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.
- A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations. arXiv e-prints, art. arXiv:2302.03025, February 2023. doi: 10.48550/arXiv.2302.03025.
- A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning, pages 6243–6267. PMLR, 2023.
- Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024.
- Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
- Disentangling factors of variation via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
- Stochastic estimation with z2 noise. Physics Letters B, 328(1–2):130–136, May 1994. ISSN 0370-2693. doi: 10.1016/0370-2693(94)90440-5. URL http://dx.doi.org/10.1016/0370-2693(94)90440-5.
- Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
- Eric Friedman. Paths and consistency in additive cost sharing. International Journal of Games Theory, 32:501–518, 08 2004. doi: 10.1007/s001820400173.
- Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
- Transformer Feed-Forward Layers Are Key-Value Memories, September 2021. URL http://arxiv.org/abs/2012.14913. arXiv:2012.14913 [cs].
- Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Localizing Model Behavior with Path Patching. arXiv e-prints, art. arXiv:2304.05969, April 2023. doi: 10.48550/arXiv.2304.05969.
- Finding neurons in a haystack: Case studies with sparse probing, 2023.
- Polysemantic attention head in a 4-layer transformer, Nov 2023. URL https://www.alignmentforum.org/posts/nuJFTS5iiJKT5G5yh/polysemantic-attention-head-in-a-4-layer-transformer.
- Disentangling by factorising. In International conference on machine learning, pages 2649–2658. PMLR, 2018.
- Lattice Quantum Chromodynamics: Practical Essentials. SpringerBriefs in Physics. Springer Netherlands, 2016. ISBN 9789402409994. URL https://books.google.co.uk/books?id=JgZODQAAQBAJ.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
- Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012.
- How far can we go without convolution: Improving fully-connected networks. arXiv preprint arXiv:1511.02580, 2015.
- Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024.
- Callum McDougall. SAE Visualizer. https://github.com/callummcdougall/sae_vis, 2024.
- Locating and editing factual associations in gpt, 2023.
- The singular value decompositions of transformer weight matrices are highly interpretable, Nov 2022. URL https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight.
- Progress measures for grokking via mechanistic interpretability, 2023a.
- Emergent linear representations in world models of self-supervised sequence models, 2023b.
- Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks, May 2016. URL http://arxiv.org/abs/1602.03616. arXiv:1602.03616 [cs].
- Feature visualization. Distill, 2017. doi: 10.23915/distill.00007.
- Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
- The hessian penalty: A weak prior for unsupervised disentanglement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 581–597. Springer, 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural computation, 4(6):863–879, 1992.
- Explaining neural networks by decoding layer activations. In Advances in Intelligent Data Analysis XIX: 19th International Symposium on Intelligent Data Analysis, IDA 2021, Porto, Portugal, April 26–28, 2021, Proceedings 19, pages 63–75. Springer, 2021.
- Taking features out of superposition with sparse autoencoders, Dec 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
- Axiomatic attribution for deep networks, 2017.
- From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), March 2019. ISSN 2045-2322. doi: 10.1038/s41598-019-41695-z. URL http://dx.doi.org/10.1038/s41598-019-41695-z.
- Toward a mathematical framework for computation in superposition, Jan 2024. URL https://www.alignmentforum.org/posts/2roZtSr5TGmLjXMnT/toward-a-mathematical-framework-for-computation-in.
- Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias. arXiv e-prints, art. arXiv:2004.12265, April 2020. doi: 10.48550/arXiv.2004.12265.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Adam Yedidia. Gpt-2’s positional embedding matrix is a helix, Jul 2023. URL https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix.
- The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks. arXiv e-prints, art. arXiv:2306.17844, June 2023. doi: 10.48550/arXiv.2306.17844.