Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
Gemini 2.5 Pro Premium
53 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
109 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization (2506.10920v1)

Published 12 Jun 2025 in cs.CL and cs.LG

Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in LLMs that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

Summary

  • The paper proposes an unsupervised SNMF approach that decomposes MLP activations into sparse, human-understandable features influencing language model behavior.
  • It leverages matrix factorization with winner-take-all sparsity to reveal additive neuron contributions, overcoming limitations of sparse autoencoders.
  • Experimental evaluations on Llama, Gemma, and GPT-2 demonstrate enhanced concept detection and robust causal interventions compared to SAE baselines.

This paper introduces a novel unsupervised method for identifying interpretable features within LLMs by decomposing Multi-Layer Perceptron (MLP) activations using Semi-Nonnegative Matrix Factorization (SNMF). The core idea is to find directions in the MLP activation space that correspond to human-understandable concepts and causally influence model behavior. This approach aims to overcome limitations of existing methods like Sparse Autoencoders (SAEs), such as difficulties in causal evaluations and lack of intrinsic interpretability tied to model computations.

Core Method: Decomposing MLP Activations with SNMF

The method focuses on the MLP layers of a transformer model. An MLP layer's output is typically MLP(h)=WVσ(WKh)MLP(\mathbf{h}) = W_V \sigma (W_K \mathbf{h}), where a=σ(WKh)\mathbf{a} = \sigma (W_K \mathbf{h}) is the vector of neuron activations. The paper proposes collecting these activation vectors aj\mathbf{a}_j for nn input tokens, forming a matrix ARda×nA \in \mathbb{R}^{d_a \times n}, where dad_a is the MLP's inner dimension.

SNMF is then used to factorize this activation matrix AA into two matrices:

  1. ZRda×kZ \in \mathbb{R}^{d_a \times k}: The MLP feature matrix, where each column zi\mathbf{z}_i is an "MLP feature." These features represent sparse linear combinations of co-activated neurons. kk is a hyperparameter for the number of features.
  2. YR0k×nY \in \mathbb{R}_{\geq 0}^{k \times n}: The nonnegative coefficient matrix. Each entry Yi,jY_{i,j} indicates how strongly MLP feature ii contributes to reconstructing the activation of token jj. This non-negativity ensures an additive, parts-based decomposition.

The factorization approximates AZYA \approx ZY.

Implementation of SNMF:

  • Initialization: Entries of ZZ are drawn from U(0,1)\mathcal{U}(0, 1), and entries of YY from N(0,1)\mathcal{N}(0, 1).
  • Optimization: The Multiplicative Updates scheme from Ding et al. (2010) is used, alternating between a closed-form update for ZZ and a multiplicative update for YY to minimize AZYF2||A - ZY||_F^2.
    • ZAYT(YYT+λI)1Z \leftarrow A Y^T (YY^T + \lambda I)^{-1}
    • YY[ZTA]++[ZTZ]Y[ZTA]+[ZTZ]+YY \leftarrow Y \odot \sqrt{\frac{[Z^T A]_+ + [Z^T Z]_- Y}{[Z^T A]_- + [Z^T Z]_+ Y}} (where [X]+[X]_+ and [X][X]_- are element-wise positive/negative parts, \odot is Hadamard product).
  • Sparsity: A winner-take-all (WTA) operator is applied to columns of ZZ, keeping only the largest p%p\% of entries by absolute value (e.g., p=1%p=1\%) and zeroing the rest.
  • Normalization: Columns of YY are normalized to unit 2\ell_2 norm, with corresponding rescaling of ZZ columns.

An MLP feature zi\mathbf{z}_i can be mapped to a residual stream feature fiRd\mathbf{f}_i \in \mathbb{R}^d by fi=WVzi\mathbf{f}_i = W_V \mathbf{z}_i. This fi\mathbf{f}_i can then be used for causal interventions.

Key Advantages over SAEs Highlighted:

  • Intrinsic Interpretability: The coefficient matrix YY directly links learned features to the input tokens that activate them.
  • Grounded in Model Mechanisms: Features are defined as combinations of existing neuron activation patterns, rather than arbitrary directions learned from scratch in the residual stream.

Experimental Evaluation

The SNMF-derived features were evaluated on Llama 3.1, Gemma 2, and GPT-2 Small across two main axes:

  1. Concept Detection:
    • Goal: Measure if features consistently activate for concept-related inputs.
    • Procedure:

    1. Describe concepts for each SNMF feature using GPT-4o-mini based on top-activating input contexts identified via the YY matrix. 2. For each description, generate 5 "activating" sentences and 5 "neutral" sentences. 3. Calculate the maximum cosine similarity between the MLP feature zi\mathbf{z}_i and the token activations aj\mathbf{a}_j for each sentence. 4. Compute the Concept Detection score (SCDS_{CD}): SCD=logaˉactivatingaˉneutralS_{CD} = \log \frac{\bar{a}_{\text{activating}}}{\bar{a}_{\text{neutral}}}. * Baselines: Publicly available SAEs trained on MLP outputs (SAE out), and SAEs trained on MLP activations using the same dataset as SNMF (SAE act). * Results: SNMF features generally achieved positive SCDS_{CD} scores (over 75% of features), indicating meaningful ties to input concepts. Performance was comparable to or better than SAE out and significantly better than SAE act (which struggled with fewer features/smaller datasets). The high interpretability of the YY matrix was noted as a potential factor for SNMF's strong performance.

    Example Concepts (Llama-3.1-8B): | Layer | Concept | Top Activating Contexts | | :---- | :------------------------------------------ | :----------------------------------------------------------------------------------------------- | | 0 | The word "resonate" and variations | "...nostalgia and innovation that resonates", "Pont Mirabeau... resonating with the..." | | 12 | Actions related to implementing/establishing | "...police enacted curfews...", "In JavaScript, establishing a coherent code layout..." | | 23 | Historical documentation | "...stone carvings, striving to reveal...", "...rich historical narratives preserved by..." |

  2. Concept Steering:

    • Goal: Evaluate causal influence on model output generation while preserving fluency.
    • Procedure:

    1. Use the prompt "<BOS> I think that". 2. Amplify the residual stream feature fi=WVzi\mathbf{f}_i = W_V \mathbf{z}_i (or its negation fi-\mathbf{f}_i) during inference. 3. Strength controlled by target KL divergence between steered and unsteered logits. 4. Generate 112 completions per feature (over 7 KL values, 2 signs, 8 samples each). 5. Score generations using GPT-4o-mini for "concept expression" and "fluency" (0-2 scale). 6. Metrics: Concept score; Harmonic mean of concept and fluency scores. * Baselines: SAE out, SAE act, Difference-in-Means (DiffMeans, a supervised method). * Results: SNMF consistently outperformed SAEs and often matched or exceeded the supervised DiffMeans baseline. Early layers showed lower final scores due to fluency degradation, even with high concept scores. SNMF's parts-based decomposition was deemed more robust to noise than DiffMeans.

Analysis of Neuron Compositionality

The paper investigates how neurons combine to form these features:

  1. Feature Merging via Recursive SNMF:

    • Method: Apply SNMF recursively to the learned feature matrix ZZ, with progressively smaller kk. This is described as finding a hierarchy AZLYLY1Y0A \approx Z_L Y_L \cdots Y_1 Y_0.
    • Observation: Fine-grained concepts (e.g., "Monday", "Tuesday") merge into more general concepts (e.g., "middle of the week", "day of week"). This "feature merging" is presented as the inverse of "feature splitting" observed in SAEs.
    • Implementation Example (Weekdays on GPT-2 Small):
      • Level 1: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
      • Level 2: Merges into "Weekend", "Mid-week", "Early-week"
      • Level 3: Merges into "Day of Week"
  2. Semantically-Related Features Share Neuron Structures:
    • Method: Binarize ZZ (top p%p\% active neurons per feature set to 1, rest to 0) to get Zˉ\bar{Z}. Compute M=ZˉZˉTM = \bar{Z}\bar{Z}^T, where Mi,jM_{i,j} is the count of shared top-activating neurons between features ii and jj.
    • Observation (Weekday example):
      • Weekday features share a "core" set of neurons (representing a general "day" concept).
      • Specific days (e.g., Monday) or groups (e.g., weekend vs. weekday) have additional "exclusive" neurons.
    • Causal Intervention (GPT-2 Large, prompt "I think that"):
      • Amplifying the "core weekday" neurons (shared across all weekday features) increased logits for all weekday tokens.
      • Amplifying "exclusive neurons" for a specific day (e.g., Monday's exclusive neurons) promoted its token (e.g., "Monday") and suppressed other weekday tokens.
      Neuron group Monday Tuesday Wednesday Thursday Friday Saturday Sunday
      Monday (exclusive) 2.0 -0.8 -1.0 -1.2 -0.8 -1.1 -0.1
      ... ... ... ... ... ... ... ...
      Core weekday 5.8 5.7 5.4 5.8 6.0 5.4 4.7

      (Values are changes in logits; positive promotes, negative suppresses)

Conclusions and Implications

SNMF offers a simple yet effective unsupervised method for discovering interpretable features in LLM MLP layers. These features, representing combinations of co-activated neurons, demonstrate strong concept alignment and causal steering capabilities, often outperforming SAEs and strong supervised baselines. The analysis reveals a hierarchical and additive composition of concepts within the MLP, where neuron groups act as building blocks. This provides insights into how MLPs construct representations and supports the idea that feature splitting in SAEs might reflect the model's inherent compositional structure.

Practical Considerations:

  • Computational Cost: SNMF involves matrix factorizations. The size of the activation matrix AA (da×nd_a \times n) and the number of features kk will influence cost.
  • Hyperparameter Tuning: kk (number of features) and pp (sparsity percentage for WTA) are key hyperparameters. The paper explored kk up to 400.
  • Dataset for Activations: The quality and diversity of texts used to collect activations AA can impact the learned features. The paper used 200 sentences for SNMF training.
  • Interpretability Tools: The YY matrix is crucial for linking features to activating text spans, aiding manual or automated (e.g., using LLMs for summarization) interpretation of features.
  • Downstream Applications: Identified features can be used for model steering, understanding concept representation, and potentially debugging or improving models.

The code for this method is released at https://github.com/ordavid-s/snmf-mlp-decomposition.

Limitations Noted:

  • Evaluations were limited to k<500k < 500; scalability to thousands of features needs further paper.
  • Optimization aspects of SNMF (initialization strategies, update rules like projected gradient descent for regularization) were not exhaustively explored. Different initializations (e.g., K-means, SVD-based) might yield different results.
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com