Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
(2506.10920v1)
Published 12 Jun 2025 in cs.CL and cs.LG
Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in LLMs that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.
Summary
The paper proposes an unsupervised SNMF approach that decomposes MLP activations into sparse, human-understandable features influencing language model behavior.
It leverages matrix factorization with winner-take-all sparsity to reveal additive neuron contributions, overcoming limitations of sparse autoencoders.
Experimental evaluations on Llama, Gemma, and GPT-2 demonstrate enhanced concept detection and robust causal interventions compared to SAE baselines.
This paper introduces a novel unsupervised method for identifying interpretable features within LLMs by decomposing Multi-Layer Perceptron (MLP) activations using Semi-Nonnegative Matrix Factorization (SNMF). The core idea is to find directions in the MLP activation space that correspond to human-understandable concepts and causally influence model behavior. This approach aims to overcome limitations of existing methods like Sparse Autoencoders (SAEs), such as difficulties in causal evaluations and lack of intrinsic interpretability tied to model computations.
Core Method: Decomposing MLP Activations with SNMF
The method focuses on the MLP layers of a transformer model. An MLP layer's output is typically MLP(h)=WVσ(WKh), where a=σ(WKh) is the vector of neuron activations. The paper proposes collecting these activation vectors aj for n input tokens, forming a matrix A∈Rda×n, where da is the MLP's inner dimension.
SNMF is then used to factorize this activation matrix A into two matrices:
Z∈Rda×k: The MLP feature matrix, where each column zi is an "MLP feature." These features represent sparse linear combinations of co-activated neurons. k is a hyperparameter for the number of features.
Y∈R≥0k×n: The nonnegative coefficient matrix. Each entry Yi,j indicates how strongly MLP feature i contributes to reconstructing the activation of token j. This non-negativity ensures an additive, parts-based decomposition.
The factorization approximates A≈ZY.
Implementation of SNMF:
Initialization: Entries of Z are drawn from U(0,1), and entries of Y from N(0,1).
Optimization: The Multiplicative Updates scheme from Ding et al. (2010) is used, alternating between a closed-form update for Z and a multiplicative update for Y to minimize ∣∣A−ZY∣∣F2.
Z←AYT(YYT+λI)−1
Y←Y⊙[ZTA]−+[ZTZ]+Y[ZTA]++[ZTZ]−Y (where [X]+ and [X]− are element-wise positive/negative parts, ⊙ is Hadamard product).
Sparsity: A winner-take-all (WTA) operator is applied to columns of Z, keeping only the largest p% of entries by absolute value (e.g., p=1%) and zeroing the rest.
Normalization: Columns of Y are normalized to unit ℓ2 norm, with corresponding rescaling of Z columns.
An MLP feature zi can be mapped to a residual stream feature fi∈Rd by fi=WVzi. This fi can then be used for causal interventions.
Key Advantages over SAEs Highlighted:
Intrinsic Interpretability: The coefficient matrix Y directly links learned features to the input tokens that activate them.
Grounded in Model Mechanisms: Features are defined as combinations of existing neuron activation patterns, rather than arbitrary directions learned from scratch in the residual stream.
Experimental Evaluation
The SNMF-derived features were evaluated on Llama 3.1, Gemma 2, and GPT-2 Small across two main axes:
Concept Detection:
Goal: Measure if features consistently activate for concept-related inputs.
Procedure:
1. Describe concepts for each SNMF feature using GPT-4o-mini based on top-activating input contexts identified via the Y matrix.
2. For each description, generate 5 "activating" sentences and 5 "neutral" sentences.
3. Calculate the maximum cosine similarity between the MLP feature zi and the token activations aj for each sentence.
4. Compute the Concept Detection score (SCD): SCD=logaˉneutralaˉactivating.
* Baselines: Publicly available SAEs trained on MLP outputs (SAE out), and SAEs trained on MLP activations using the same dataset as SNMF (SAE act).
* Results: SNMF features generally achieved positive SCD scores (over 75% of features), indicating meaningful ties to input concepts. Performance was comparable to or better than SAE out and significantly better than SAE act (which struggled with fewer features/smaller datasets). The high interpretability of the Y matrix was noted as a potential factor for SNMF's strong performance.
Example Concepts (Llama-3.1-8B):
| Layer | Concept | Top Activating Contexts |
| :---- | :------------------------------------------ | :----------------------------------------------------------------------------------------------- |
| 0 | The word "resonate" and variations | "...nostalgia and innovation that resonates", "Pont Mirabeau... resonating with the..." |
| 12 | Actions related to implementing/establishing | "...police enacted curfews...", "In JavaScript, establishing a coherent code layout..." |
| 23 | Historical documentation | "...stone carvings, striving to reveal...", "...rich historical narratives preserved by..." |
Concept Steering:
Goal: Evaluate causal influence on model output generation while preserving fluency.
Procedure:
1. Use the prompt "<BOS> I think that".
2. Amplify the residual stream feature fi=WVzi (or its negation −fi) during inference.
3. Strength controlled by target KL divergence between steered and unsteered logits.
4. Generate 112 completions per feature (over 7 KL values, 2 signs, 8 samples each).
5. Score generations using GPT-4o-mini for "concept expression" and "fluency" (0-2 scale).
6. Metrics: Concept score; Harmonic mean of concept and fluency scores.
* Baselines: SAE out, SAE act, Difference-in-Means (DiffMeans, a supervised method).
* Results: SNMF consistently outperformed SAEs and often matched or exceeded the supervised DiffMeans baseline. Early layers showed lower final scores due to fluency degradation, even with high concept scores. SNMF's parts-based decomposition was deemed more robust to noise than DiffMeans.
Analysis of Neuron Compositionality
The paper investigates how neurons combine to form these features:
Feature Merging via Recursive SNMF:
Method: Apply SNMF recursively to the learned feature matrix Z, with progressively smaller k. This is described as finding a hierarchy A≈ZLYL⋯Y1Y0.
Observation: Fine-grained concepts (e.g., "Monday", "Tuesday") merge into more general concepts (e.g., "middle of the week", "day of week"). This "feature merging" is presented as the inverse of "feature splitting" observed in SAEs.
Level 2: Merges into "Weekend", "Mid-week", "Early-week"
Level 3: Merges into "Day of Week"
Semantically-Related Features Share Neuron Structures:
Method: Binarize Z (top p% active neurons per feature set to 1, rest to 0) to get Zˉ. Compute M=ZˉZˉT, where Mi,j is the count of shared top-activating neurons between features i and j.
Observation (Weekday example):
Weekday features share a "core" set of neurons (representing a general "day" concept).
Specific days (e.g., Monday) or groups (e.g., weekend vs. weekday) have additional "exclusive" neurons.
Causal Intervention (GPT-2 Large, prompt "I think that"):
Amplifying the "core weekday" neurons (shared across all weekday features) increased logits for all weekday tokens.
Amplifying "exclusive neurons" for a specific day (e.g., Monday's exclusive neurons) promoted its token (e.g., "Monday") and suppressed other weekday tokens.
Neuron group
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday (exclusive)
2.0
-0.8
-1.0
-1.2
-0.8
-1.1
-0.1
...
...
...
...
...
...
...
...
Core weekday
5.8
5.7
5.4
5.8
6.0
5.4
4.7
(Values are changes in logits; positive promotes, negative suppresses)
Conclusions and Implications
SNMF offers a simple yet effective unsupervised method for discovering interpretable features in LLM MLP layers. These features, representing combinations of co-activated neurons, demonstrate strong concept alignment and causal steering capabilities, often outperforming SAEs and strong supervised baselines. The analysis reveals a hierarchical and additive composition of concepts within the MLP, where neuron groups act as building blocks. This provides insights into how MLPs construct representations and supports the idea that feature splitting in SAEs might reflect the model's inherent compositional structure.
Practical Considerations:
Computational Cost: SNMF involves matrix factorizations. The size of the activation matrix A (da×n) and the number of features k will influence cost.
Hyperparameter Tuning:k (number of features) and p (sparsity percentage for WTA) are key hyperparameters. The paper explored k up to 400.
Dataset for Activations: The quality and diversity of texts used to collect activations A can impact the learned features. The paper used 200 sentences for SNMF training.
Interpretability Tools: The Y matrix is crucial for linking features to activating text spans, aiding manual or automated (e.g., using LLMs for summarization) interpretation of features.
Downstream Applications: Identified features can be used for model steering, understanding concept representation, and potentially debugging or improving models.
The code for this method is released at https://github.com/ordavid-s/snmf-mlp-decomposition.
Limitations Noted:
Evaluations were limited to k<500; scalability to thousands of features needs further paper.
Optimization aspects of SNMF (initialization strategies, update rules like projected gradient descent for regularization) were not exhaustively explored. Different initializations (e.g., K-means, SVD-based) might yield different results.