Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks (2405.10928v2)

Published 17 May 2024 in cs.LG

Abstract: Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to LLMs. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Representation learning: A review and new perspectives, 2014.
  2. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  3. Degeneracy in the loss landscape can be used to identify the algorithm implemented by a neural network, May 2024a. URL https://publications.apolloresearch.ai/degeneracy_loss_landscape_interpretability.
  4. Interpretability: Integrated gradients is a decent attribution method, May 2024b. URL https://publications.apolloresearch.ai/integrated_gradients_criteria.
  5. Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  6. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.
  7. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations. arXiv e-prints, art. arXiv:2302.03025, February 2023. doi: 10.48550/arXiv.2302.03025.
  8. A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning, pages 6243–6267. PMLR, 2023.
  9. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024.
  10. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
  11. Disentangling factors of variation via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
  12. Stochastic estimation with z2 noise. Physics Letters B, 328(1–2):130–136, May 1994. ISSN 0370-2693. doi: 10.1016/0370-2693(94)90440-5. URL http://dx.doi.org/10.1016/0370-2693(94)90440-5.
  13. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  14. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  15. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
  16. Eric Friedman. Paths and consistency in additive cost sharing. International Journal of Games Theory, 32:501–518, 08 2004. doi: 10.1007/s001820400173.
  17. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
  18. Transformer Feed-Forward Layers Are Key-Value Memories, September 2021. URL http://arxiv.org/abs/2012.14913. arXiv:2012.14913 [cs].
  19. Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030.
  20. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  21. Localizing Model Behavior with Path Patching. arXiv e-prints, art. arXiv:2304.05969, April 2023. doi: 10.48550/arXiv.2304.05969.
  22. Finding neurons in a haystack: Case studies with sparse probing, 2023.
  23. Polysemantic attention head in a 4-layer transformer, Nov 2023. URL https://www.alignmentforum.org/posts/nuJFTS5iiJKT5G5yh/polysemantic-attention-head-in-a-4-layer-transformer.
  24. Disentangling by factorising. In International conference on machine learning, pages 2649–2658. PMLR, 2018.
  25. Lattice Quantum Chromodynamics: Practical Essentials. SpringerBriefs in Physics. Springer Netherlands, 2016. ISBN 9789402409994. URL https://books.google.co.uk/books?id=JgZODQAAQBAJ.
  26. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  27. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012.
  28. How far can we go without convolution: Improving fully-connected networks. arXiv preprint arXiv:1511.02580, 2015.
  29. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024.
  30. Callum McDougall. SAE Visualizer. https://github.com/callummcdougall/sae_vis, 2024.
  31. Locating and editing factual associations in gpt, 2023.
  32. The singular value decompositions of transformer weight matrices are highly interpretable, Nov 2022. URL https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight.
  33. Progress measures for grokking via mechanistic interpretability, 2023a.
  34. Emergent linear representations in world models of self-supervised sequence models, 2023b.
  35. Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks, May 2016. URL http://arxiv.org/abs/1602.03616. arXiv:1602.03616 [cs].
  36. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007.
  37. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  38. The hessian penalty: A weak prior for unsupervised disentanglement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 581–597. Springer, 2020.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural computation, 4(6):863–879, 1992.
  41. Explaining neural networks by decoding layer activations. In Advances in Intelligent Data Analysis XIX: 19th International Symposium on Intelligent Data Analysis, IDA 2021, Porto, Portugal, April 26–28, 2021, Proceedings 19, pages 63–75. Springer, 2021.
  42. Taking features out of superposition with sparse autoencoders, Dec 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
  43. Axiomatic attribution for deep networks, 2017.
  44. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), March 2019. ISSN 2045-2322. doi: 10.1038/s41598-019-41695-z. URL http://dx.doi.org/10.1038/s41598-019-41695-z.
  45. Toward a mathematical framework for computation in superposition, Jan 2024. URL https://www.alignmentforum.org/posts/2roZtSr5TGmLjXMnT/toward-a-mathematical-framework-for-computation-in.
  46. Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias. arXiv e-prints, art. arXiv:2004.12265, April 2020. doi: 10.48550/arXiv.2004.12265.
  47. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  48. Adam Yedidia. Gpt-2’s positional embedding matrix is a helix, Jul 2023. URL https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix.
  49. The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks. arXiv e-prints, art. arXiv:2306.17844, June 2023. doi: 10.48550/arXiv.2306.17844.
Citations (3)

Summary

  • The paper introduces the LIB method that transforms activations via PCA and SVD to isolate sparse, computationally-relevant features in neural networks.
  • The method produces clearer interaction graphs and distinct computational modules compared to traditional PCA, as demonstrated on transformer, CIFAR-10, and language models.
  • Although LIB enhances interpretability for simpler models, its limited impact on complex language models highlights the need for further refinements in mechanistic interpretability.

Understanding Neural Networks with Local Interaction Basis

Background: Mechanistic Interpretability

Let's start by revisiting the idea of mechanistic interpretability. The goal here is to reverse-engineer neural networks to understand exactly how they perform their computations. While great strides have been made in understanding simplistic models and toy tasks, understanding complex neural networks, especially those involved in NLP and LLMs, remains challenging.

The Problem with Current Methods

Standard methods that inspect individual neurons or model components often fall short. Neural networks have layers and layers of activations, and merely looking at principal components or inspecting individual neurons doesn't give a clear picture. These components don't correspond neatly to distinct features or functions—they're often polysemantic (i.e., they represent multiple features or functions at once).

In other words, techniques like principal component analysis (PCA) may decompose the network's activations but still leave us scratching our heads about what each component really means in terms of the network's computations.

What is the Local Interaction Basis (LIB)?

This paper introduces an innovative method called the Local Interaction Basis (LIB). LIB aims to identify computationally-relevant features within neural networks by eliminating irrelevant activations and aligning features more meaningfully. Essentially, the LIB method involves transforming the network's activations into a new basis that:

  1. Removes irrelevant activation directions
  2. Aligns with the singular vectors of the Jacobian matrix between adjacent layers
  3. Scales features based on their importance for downstream computations

The ultimate outcome is a cleaner interaction graph that highlights computationally important features and their interactions, giving us a brighter flashlight to navigate the dark corridors of neural networks.

Methodology Breakdown

Step 1: Transformation into Local Interaction Basis (LIB)

LIB starts by selecting a subset of network layers to transform. Here's a bird's-eye view of what happens next:

  1. Initial Transformation Using PCA: Each layer's activations are transformed using PCA, which drops nearly-zero variance directions and whitens the activations.
  2. Refinement Using Jacobian Matrices: Next, the method computes how these activations connect to subsequent layers using Jacobian matrices. Singular value decomposition (SVD) is applied to the Jacobians to transform the activations into even more meaningful directions aligned with computationally-relevant features.

Step 2: Integrated Gradient Attributions

To quantify the importance of interactions between features, the method utilizes integrated gradients. Integrated gradients help attribute the importance of one feature to another, capturing how upstream layers influence downstream ones. This step builds a robust graph of interactions, ensuring the attributions are reliable and invariant under different implementations.

Step 3: Interaction Graph Analysis

The final step involves analyzing the resulting interaction graph to identify sparse and modular computational structures:

  • Sparse Interactions: By systematically ablating (i.e., removing) edges, the method checks how sparse the interactions really are.
  • Modularity: Using algorithms like the Leiden algorithm, the method identifies clusters or modules in the interaction graph. These modules potentially represent distinct computational circuits within the network.

Evaluating LIB: Examples and Results

The paper applies LIB to several models: a modular addition task transformer, a CIFAR-10 model, and two LLMs (GPT2-small and Tinystories-1M).

Modular Addition Transformer

LIB outperformed PCA in several respects on this toy task:

  • Functional Relevance: LIB excluded irrelevant features better than PCA.
  • Sparsity: Computations in the LIB basis were more sparsely interacting compared to the PCA basis, particularly in the attention layers.
  • Modularity: The method successfully identified distinct computational modules.

CIFAR-10 Model

On a CIFAR-10 model:

  • Sparsity: LIB interactions were considerably sparser than PCA interactions.
  • Isolating Features: LIB isolated specific features, like an “animal vs. vehicle” feature, more effectively than PCA.

LLMs

Results were mixed for LLMs:

  • Positional Features: LIB captured positional features well, but so did PCA.
  • Sparsity: LIB did produce sparser interactions in certain layers compared to PCA. However, the gains were mainly marginal and noisy.
  • Interpretability: Neither LIB nor PCA produced highly interpretable features.

Implications and Future Directions

While LIB showed promise for simpler models, it didn't substantially advance interpretability for complex LLMs. This suggests that future directions could involve:

  • Generalizing to Overcomplete Bases: Allowing for the possibility of features represented in superposition.
  • Finer-Grained Techniques for LLMs: Exploring methods beyond linear transformations and singular value analysis to capture more intricate relationships.

In conclusion, while the LIB method provides a new and promising approach for interpreting neural networks, especially simpler models, its application to LLMs needs further refinement. This work lays the groundwork for understanding and disentangling the complex features hidden within neural networks, pointing to a future where AI interpretability becomes more practical and tangible.