Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders (2506.14002v1)

Published 16 Jun 2025 in cs.LG, cs.AI, cs.IT, math.IT, and stat.ML

Abstract: We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of LLMs. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically \highlight{prove that this algorithm correctly recovers all monosemantic features} when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and \highlight{demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters}. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.

Summary

The paper presents a novel Group Bias Adaptation (GBA) algorithm that provably recovers monosemantic features from polysemantic LLM activations.
It introduces a rigorous statistical framework with ε-identifiability to ensure reliable feature recovery under targeted sparsity and balance conditions.
Empirical results on Qwen2.5-1.5B show that GBA achieves superior sparsity-loss trade-offs, robustness, and consistent feature discovery.

This paper introduces a novel approach to training Sparse Autoencoders (SAEs) for LLMs with the goal of recovering interpretable, monosemantic features from polysemantic activations. The authors address key limitations of existing SAE training methods, such as the lack of theoretical guarantees for feature recovery, sensitivity to hyperparameters, and training instability.

The core contributions include:

A new statistical framework for feature recovery, modeling polysemantic activations ( $x$ ) as sparse linear combinations of underlying monosemantic features ( $V$ ) with non-negative coefficients ( $H$ ), i.e., $x \approx HV$ . This framework introduces a rigorous notion of feature $\epsilon$ -identifiability, accounting for ambiguities like permutation, scaling, and feature splitting.
A novel SAE training algorithm called Group Bias Adaptation (GBA). GBA directly controls neuron activation sparsity by adaptively adjusting bias parameters ( $b_m$ ) for groups of neurons, aiming to meet pre-defined Target Activation Frequencies (TAFs) for each group.
Theoretical proof that a simplified version of GBA (Modified Bias Adaptation) can provably recover all true monosemantic features when input data is sampled from their proposed statistical model under specific conditions on network width, bias range, and feature balance.
Empirical demonstration of GBA's superior performance on LLMs up to 1.5 billion parameters (Qwen2.5-1.5B), achieving a better sparsity-loss trade-off and learning more consistent features across runs compared to L1 regularization and TopK activation methods.

Statistical Framework and Feature Identifiability

The paper models an LLM's internal activation vector $x \in \mathbb{R}^d$ as a sparse, non-negative linear combination of $n$ monosemantic feature vectors $v_i \in \mathbb{R}^d$ (rows of $V \in \mathbb{R}^{n \times d}$ ): $x = H V$ , where $H \in \mathbb{R}^{N \times n}$ is the coefficient matrix for $N$ data points, with each row being $s$ -sparse. The goal is to recover $V$ .

To address the inherent non-uniqueness of this factorization, $\epsilon$ -identifiability is defined. A feature matrix $V$ is $\epsilon$ -identifiable if any alternative factorization $X=H'V'$ implies that $V'$ is equivalent to $V$ up to permutation, feature splitting (where a feature in $V$ is a positive linear combination of features in $V'$ ), and small cosine similarity deviations bounded by $\epsilon$ . Theorem 5.3 states that under certain conditions on $H$ (row-wise sparsity, non-degeneracy, low co-occurrence $\rho^2 < n^{-1/2}$ ) and $V$ (incoherence), $V$ is $\epsilon$ -identifiable with $\epsilon=o(1)$ .

Group Bias Adaptation (GBA) Algorithm

The GBA algorithm aims to overcome the limitations of traditional sparsity-inducing methods like L1 regularization (which causes activation shrinkage) and TopK activation (which can be sensitive to initialization).

Key Ideas:

Bias Adaptation: Instead of an explicit sparsity penalty in the loss function, GBA directly controls the activation frequency of each neuron. The bias $b_m$ $b_{m}$ of a neuron $m$ $m$ (in pre-activation $y_m = w_m^T(x-b_{pre}) + b_m$ $y_{m} = w_{m}^{T} (x - b_{p re}) + b_{m}$ ) is adjusted periodically.
- If a neuron activates too frequently (actual frequency $p_m >$ TAF $p_k$ ), its bias $b_m$ is decreased.
- If a neuron activates too rarely ( $p_m < \epsilon$ ), its bias $b_m$ is increased.
Neuron Grouping: Neurons are divided into $K$ groups, each assigned a different TAF ( $p_k$ ). TAFs are typically set in an exponentially decaying sequence (e.g., $p_1=0.1, p_2=0.05, \dots$ ). This allows the SAE to capture features with varying natural occurrence frequencies.

Algorithm 1: Group Bias Adaptation (GBA)

Input: Data $X$ , initial SAE parameters $\Theta^{(0)}$ , neuron groups $\{G_k, p_k\}$ , optimizer Opt.
Hyperparameters: Iterations $T$ , batch size $L$ , buffer size $B$ , bias adaptation rates $\gamma_+, \gamma_-$ , rarity threshold $\epsilon$ .
Initialize buffers $B_m = \emptyset$ for each neuron $m$ .
For $t=1, \dots, T$ :

a. Sample mini-batch $X_t$ , normalize rows. b. Compute pre-activations $y^{(t)}$ . c. Compute reconstruction loss $\mathcal{L}^{(t)}$ . d. Update SAE parameters $\Theta^{(t)}$ (except biases $\{b_m\}$ ) using Opt. e. Add pre-activations $y_m^{(t)}$ to buffers $B_m$ . f. If $|B_1| \ge B$ (buffer full):

i. Update biases $b^{(t)}$ using Subroutine $A_t(b^{(t-1)}, B)$ (Algorithm 2). ii. Empty all buffers $B_m$ .
Return final SAE parameters $\Theta^{(T)}$ .

Algorithm 2: GBA Subroutine $A_t$ (Bias Adaptation)

Input: Current biases $b$ , buffers $B=\{B_m\}$ , groups $\{G_k, p_k\}$ , hyperparameters $\gamma_+, \gamma_-, \epsilon$ .
For each neuron $m$ : a. Compute activation frequency $p_m = |B_m|^{-1} \sum_{y \in B_m} \mathbf{1}(y>0)$ . b. Compute max pre-activation $r_m = \max\{\max_{y \in B_m} y, 0\}$ .
For each group $k$ : a. Compute average max pre-activation for active neurons in group $T_k = (\sum_{m \in G_k} \mathbf{1}(r_m>0))^{-1} \sum_{m \in G_k} r_m$ .
For each group $k$ and each neuron $m \in G_k$ : a. If $p_m > p_k$ : $b_m \leftarrow \max\{b_m - \gamma_- r_m, -1\}$ . b. If $p_m < \epsilon$ : $b_m \leftarrow \min\{b_m + \gamma_+ T_k, 0\}$ .
Return updated biases $b$ .

The biases are clamped to $[-1, 0]$ to maintain sparsity and prevent over-sparsification. Efficient implementation involves iteratively updating $p_m$ and $r_m$ rather than storing all pre-activations.

Theoretical Recovery Guarantees

For a simplified "Modified BA" algorithm (single neuron group, fixed bias $b$ implying a fixed TAF $p = \Phi(-b)$ , smooth ReLU-like activation, and vanishingly small output scales $a_m \to 0$ ), Theorem 6.1 provides provable feature recovery. It states that if:

The data $X=HV$ is decomposable with i.i.d. Gaussian features $V$ .
Network width $M$ is sufficiently large: $\log M / \log n \gtrsim (1-\epsilon)^{-2} h_*^{-2} b^{-2}$ , where $h_*$ relates to coefficient concentration.
Bias $b$ is in a specific range, implying a TAF $p$ such that $n^{-1} < p < \min\{n^{-(1+h_*^2)/2}, dn^{-1}\}$ . This range depends on the superposition regime ( $d$ vs $n$ ).
A "Feature Balance" condition holds (all features appear sufficiently often with sufficiently large coefficients).

Then, Modified BA recovers all monosemantic features $v_i$ with high probability. The proof involves showing good initialization, approximately Gaussian pre-activations via Gaussian conditioning, and analyzing the dynamics of weight alignment using Efron-Stein inequalities.

Empirical Results

Experiments were conducted on the Qwen2.5-1.5B model, extracting MLP activations from layers 2, 13, and 26 on Pile Github and Wikipedia datasets. JumpReLU was used as the activation function for all methods.

Sparsity-Loss Frontier: GBA performs comparably to the best TopK (post-activation sparsity) and significantly outperforms L1 regularization and a non-grouped Bias Adaptation (BA) baseline.
Hyperparameter Robustness: GBA is nearly tuning-free. Its performance is robust to the number of groups $K$ and the specific TAFs, provided the Highest TAF (HTF) is adequately high (e.g., 0.1-0.5) and $K$ is sufficiently large (e.g., 10-20). This is a significant practical advantage.
Feature Consistency: Measured by Maximum Cosine Similarity (MCS) across runs with different random seeds, GBA learns significantly more consistent features than TopK. L1 is generally more consistent, but GBA surpasses L1 for the most active (top 0.05%) features.
Feature Analysis: Scatter plots of Z-scores vs. other metrics (max activation, activation fraction, MCS) and a feature dashboard example show that GBA learns sparse, selective, and consistent features. For example, high Z-score GBA neurons often correspond to specific, infrequent concepts and show high MCS.

Implementation Considerations

Computational Cost: Training SAEs is generally expensive. GBA adds minimal overhead compared to standard SAE training; the bias adaptation step is efficient.
Buffer Management: The bias adaptation step requires a buffer of pre-activations. The paper suggests updating biases every 50 gradient steps with the largest batch size hardware permits. Efficiently, only $p_m$ and $r_m$ need to be tracked, not the full buffer.
Activation Function: While theory uses smooth ReLU-like functions, experiments use JumpReLU, which empirically works well with GBA, especially for decoupling neuron output magnitude from its bias.
Deployment: Trained SAEs can be used to replace MLP layers in LLMs for interpretability or potentially to steer model behavior. The features learned by GBA are more consistent, making them more reliable for downstream interpretability tasks.

Practical Implications

This research offers a more robust and theoretically grounded method for training SAEs.

Improved Interpretability: By recovering more consistent and potentially more monosemantic features, GBA can enhance our understanding of LLM internal workings.
Reduced Tuning Effort: The near tuning-free nature of GBA makes it more practical for researchers and practitioners to apply SAEs without extensive hyperparameter searches.
Reliable Feature Discovery: Higher consistency implies that the features discovered are less likely to be artifacts of random initialization, leading to more trustworthy interpretations.

The paper lays a strong foundation by bridging theoretical understanding and practical application of SAEs, paving the way for more transparent and trustworthy AI systems. Future work includes extending theoretical guarantees to more general settings and using the learned features for model interventions and circuit discovery.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhuoran_yang/status/1935194818788884778

https://twitter.com/fly51fly/status/1936903418519695487

https://twitter.com/arxivsanitybot/status/1935333692668309864

https://twitter.com/TheTuringPost/status/1937290523167044015