Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

184 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks (2405.10928v2)

Published 17 May 2024 in cs.LG

Abstract: Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to LLMs. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to LLMs.

References (49)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the LIB method that transforms activations via PCA and SVD to isolate sparse, computationally-relevant features in neural networks.
The method produces clearer interaction graphs and distinct computational modules compared to traditional PCA, as demonstrated on transformer, CIFAR-10, and language models.
Although LIB enhances interpretability for simpler models, its limited impact on complex language models highlights the need for further refinements in mechanistic interpretability.

Understanding Neural Networks with Local Interaction Basis

Background: Mechanistic Interpretability

Let's start by revisiting the idea of mechanistic interpretability. The goal here is to reverse-engineer neural networks to understand exactly how they perform their computations. While great strides have been made in understanding simplistic models and toy tasks, understanding complex neural networks, especially those involved in NLP and LLMs, remains challenging.

The Problem with Current Methods

Standard methods that inspect individual neurons or model components often fall short. Neural networks have layers and layers of activations, and merely looking at principal components or inspecting individual neurons doesn't give a clear picture. These components don't correspond neatly to distinct features or functions—they're often polysemantic (i.e., they represent multiple features or functions at once).

In other words, techniques like principal component analysis (PCA) may decompose the network's activations but still leave us scratching our heads about what each component really means in terms of the network's computations.

What is the Local Interaction Basis (LIB)?

This paper introduces an innovative method called the Local Interaction Basis (LIB). LIB aims to identify computationally-relevant features within neural networks by eliminating irrelevant activations and aligning features more meaningfully. Essentially, the LIB method involves transforming the network's activations into a new basis that:

Removes irrelevant activation directions
Aligns with the singular vectors of the Jacobian matrix between adjacent layers
Scales features based on their importance for downstream computations

The ultimate outcome is a cleaner interaction graph that highlights computationally important features and their interactions, giving us a brighter flashlight to navigate the dark corridors of neural networks.

Methodology Breakdown

Step 1: Transformation into Local Interaction Basis (LIB)

LIB starts by selecting a subset of network layers to transform. Here's a bird's-eye view of what happens next:

Initial Transformation Using PCA: Each layer's activations are transformed using PCA, which drops nearly-zero variance directions and whitens the activations.
Refinement Using Jacobian Matrices: Next, the method computes how these activations connect to subsequent layers using Jacobian matrices. Singular value decomposition (SVD) is applied to the Jacobians to transform the activations into even more meaningful directions aligned with computationally-relevant features.

Step 2: Integrated Gradient Attributions

To quantify the importance of interactions between features, the method utilizes integrated gradients. Integrated gradients help attribute the importance of one feature to another, capturing how upstream layers influence downstream ones. This step builds a robust graph of interactions, ensuring the attributions are reliable and invariant under different implementations.

Step 3: Interaction Graph Analysis

The final step involves analyzing the resulting interaction graph to identify sparse and modular computational structures:

Sparse Interactions: By systematically ablating (i.e., removing) edges, the method checks how sparse the interactions really are.
Modularity: Using algorithms like the Leiden algorithm, the method identifies clusters or modules in the interaction graph. These modules potentially represent distinct computational circuits within the network.

Evaluating LIB: Examples and Results

The paper applies LIB to several models: a modular addition task transformer, a CIFAR-10 model, and two LLMs (GPT2-small and Tinystories-1M).

Modular Addition Transformer

LIB outperformed PCA in several respects on this toy task:

Functional Relevance: LIB excluded irrelevant features better than PCA.
Sparsity: Computations in the LIB basis were more sparsely interacting compared to the PCA basis, particularly in the attention layers.
Modularity: The method successfully identified distinct computational modules.

CIFAR-10 Model

On a CIFAR-10 model:

Sparsity: LIB interactions were considerably sparser than PCA interactions.
Isolating Features: LIB isolated specific features, like an “animal vs. vehicle” feature, more effectively than PCA.

LLMs

Results were mixed for LLMs:

Positional Features: LIB captured positional features well, but so did PCA.
Sparsity: LIB did produce sparser interactions in certain layers compared to PCA. However, the gains were mainly marginal and noisy.
Interpretability: Neither LIB nor PCA produced highly interpretable features.

Implications and Future Directions

While LIB showed promise for simpler models, it didn't substantially advance interpretability for complex LLMs. This suggests that future directions could involve:

Generalizing to Overcomplete Bases: Allowing for the possibility of features represented in superposition.
Finer-Grained Techniques for LLMs: Exploring methods beyond linear transformations and singular value analysis to capture more intricate relationships.

In conclusion, while the LIB method provides a new and promising approach for interpreting neural networks, especially simpler models, its application to LLMs needs further refinement. This work lays the groundwork for understanding and disentangling the complex features hidden within neural networks, pointing to a future where AI interpretability becomes more practical and tangible.

PDF Markdown

Tweets

https://twitter.com/fly51fly/status/1794718371524813202

https://twitter.com/knishimae0531/status/1794882072714354985