Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders (2506.14002v1)

Published 16 Jun 2025 in cs.LG, cs.AI, cs.IT, math.IT, and stat.ML

Abstract: We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of LLMs. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically \highlight{prove that this algorithm correctly recovers all monosemantic features} when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and \highlight{demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters}. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.

Summary

  • The paper presents a novel Group Bias Adaptation (GBA) algorithm that provably recovers monosemantic features from polysemantic LLM activations.
  • It introduces a rigorous statistical framework with ε-identifiability to ensure reliable feature recovery under targeted sparsity and balance conditions.
  • Empirical results on Qwen2.5-1.5B show that GBA achieves superior sparsity-loss trade-offs, robustness, and consistent feature discovery.

This paper introduces a novel approach to training Sparse Autoencoders (SAEs) for LLMs with the goal of recovering interpretable, monosemantic features from polysemantic activations. The authors address key limitations of existing SAE training methods, such as the lack of theoretical guarantees for feature recovery, sensitivity to hyperparameters, and training instability.

The core contributions include:

  1. A new statistical framework for feature recovery, modeling polysemantic activations (xx) as sparse linear combinations of underlying monosemantic features (VV) with non-negative coefficients (HH), i.e., xHVx \approx HV. This framework introduces a rigorous notion of feature ϵ\epsilon-identifiability, accounting for ambiguities like permutation, scaling, and feature splitting.
  2. A novel SAE training algorithm called Group Bias Adaptation (GBA). GBA directly controls neuron activation sparsity by adaptively adjusting bias parameters (bmb_m) for groups of neurons, aiming to meet pre-defined Target Activation Frequencies (TAFs) for each group.
  3. Theoretical proof that a simplified version of GBA (Modified Bias Adaptation) can provably recover all true monosemantic features when input data is sampled from their proposed statistical model under specific conditions on network width, bias range, and feature balance.
  4. Empirical demonstration of GBA's superior performance on LLMs up to 1.5 billion parameters (Qwen2.5-1.5B), achieving a better sparsity-loss trade-off and learning more consistent features across runs compared to L1 regularization and TopK activation methods.

Statistical Framework and Feature Identifiability

The paper models an LLM's internal activation vector xRdx \in \mathbb{R}^d as a sparse, non-negative linear combination of nn monosemantic feature vectors viRdv_i \in \mathbb{R}^d (rows of VRn×dV \in \mathbb{R}^{n \times d}): x=HVx = H V, where HRN×nH \in \mathbb{R}^{N \times n} is the coefficient matrix for NN data points, with each row being ss-sparse. The goal is to recover VV.

To address the inherent non-uniqueness of this factorization, ϵ\epsilon-identifiability is defined. A feature matrix VV is ϵ\epsilon-identifiable if any alternative factorization X=HVX=H'V' implies that VV' is equivalent to VV up to permutation, feature splitting (where a feature in VV is a positive linear combination of features in VV'), and small cosine similarity deviations bounded by ϵ\epsilon. Theorem 5.3 states that under certain conditions on HH (row-wise sparsity, non-degeneracy, low co-occurrence ρ2<n1/2\rho^2 < n^{-1/2}) and VV (incoherence), VV is ϵ\epsilon-identifiable with ϵ=o(1)\epsilon=o(1).

Group Bias Adaptation (GBA) Algorithm

The GBA algorithm aims to overcome the limitations of traditional sparsity-inducing methods like L1 regularization (which causes activation shrinkage) and TopK activation (which can be sensitive to initialization).

Key Ideas:

  1. Bias Adaptation: Instead of an explicit sparsity penalty in the loss function, GBA directly controls the activation frequency of each neuron. The bias bmb_m of a neuron mm (in pre-activation ym=wmT(xbpre)+bmy_m = w_m^T(x-b_{pre}) + b_m) is adjusted periodically.
    • If a neuron activates too frequently (actual frequency pm>p_m > TAF pkp_k), its bias bmb_m is decreased.
    • If a neuron activates too rarely (pm<ϵp_m < \epsilon), its bias bmb_m is increased.
  2. Neuron Grouping: Neurons are divided into KK groups, each assigned a different TAF (pkp_k). TAFs are typically set in an exponentially decaying sequence (e.g., p1=0.1,p2=0.05,p_1=0.1, p_2=0.05, \dots). This allows the SAE to capture features with varying natural occurrence frequencies.

Algorithm 1: Group Bias Adaptation (GBA)

  1. Input: Data XX, initial SAE parameters Θ(0)\Theta^{(0)}, neuron groups {Gk,pk}\{G_k, p_k\}, optimizer Opt.
  2. Hyperparameters: Iterations TT, batch size LL, buffer size BB, bias adaptation rates γ+,γ\gamma_+, \gamma_-, rarity threshold ϵ\epsilon.
  3. Initialize buffers Bm=B_m = \emptyset for each neuron mm.
  4. For t=1,,Tt=1, \dots, T:

    a. Sample mini-batch XtX_t, normalize rows. b. Compute pre-activations y(t)y^{(t)}. c. Compute reconstruction loss L(t)\mathcal{L}^{(t)}. d. Update SAE parameters Θ(t)\Theta^{(t)} (except biases {bm}\{b_m\}) using Opt. e. Add pre-activations ym(t)y_m^{(t)} to buffers BmB_m. f. If B1B|B_1| \ge B (buffer full):

    i. Update biases b(t)b^{(t)} using Subroutine At(b(t1),B)A_t(b^{(t-1)}, B) (Algorithm 2). ii. Empty all buffers BmB_m.

  5. Return final SAE parameters Θ(T)\Theta^{(T)}.

Algorithm 2: GBA Subroutine AtA_t (Bias Adaptation)

  1. Input: Current biases bb, buffers B={Bm}B=\{B_m\}, groups {Gk,pk}\{G_k, p_k\}, hyperparameters γ+,γ,ϵ\gamma_+, \gamma_-, \epsilon.
  2. For each neuron mm: a. Compute activation frequency pm=Bm1yBm1(y>0)p_m = |B_m|^{-1} \sum_{y \in B_m} \mathbf{1}(y>0). b. Compute max pre-activation rm=max{maxyBmy,0}r_m = \max\{\max_{y \in B_m} y, 0\}.
  3. For each group kk: a. Compute average max pre-activation for active neurons in group Tk=(mGk1(rm>0))1mGkrmT_k = (\sum_{m \in G_k} \mathbf{1}(r_m>0))^{-1} \sum_{m \in G_k} r_m.
  4. For each group kk and each neuron mGkm \in G_k: a. If pm>pkp_m > p_k: bmmax{bmγrm,1}b_m \leftarrow \max\{b_m - \gamma_- r_m, -1\}. b. If pm<ϵp_m < \epsilon: bmmin{bm+γ+Tk,0}b_m \leftarrow \min\{b_m + \gamma_+ T_k, 0\}.
  5. Return updated biases bb.

The biases are clamped to [1,0][-1, 0] to maintain sparsity and prevent over-sparsification. Efficient implementation involves iteratively updating pmp_m and rmr_m rather than storing all pre-activations.

Theoretical Recovery Guarantees

For a simplified "Modified BA" algorithm (single neuron group, fixed bias bb implying a fixed TAF p=Φ(b)p = \Phi(-b), smooth ReLU-like activation, and vanishingly small output scales am0a_m \to 0), Theorem 6.1 provides provable feature recovery. It states that if:

  • The data X=HVX=HV is decomposable with i.i.d. Gaussian features VV.
  • Network width MM is sufficiently large: logM/logn(1ϵ)2h2b2\log M / \log n \gtrsim (1-\epsilon)^{-2} h_*^{-2} b^{-2}, where hh_* relates to coefficient concentration.
  • Bias bb is in a specific range, implying a TAF pp such that n1<p<min{n(1+h2)/2,dn1}n^{-1} < p < \min\{n^{-(1+h_*^2)/2}, dn^{-1}\}. This range depends on the superposition regime (dd vs nn).
  • A "Feature Balance" condition holds (all features appear sufficiently often with sufficiently large coefficients).

Then, Modified BA recovers all monosemantic features viv_i with high probability. The proof involves showing good initialization, approximately Gaussian pre-activations via Gaussian conditioning, and analyzing the dynamics of weight alignment using Efron-Stein inequalities.

Empirical Results

Experiments were conducted on the Qwen2.5-1.5B model, extracting MLP activations from layers 2, 13, and 26 on Pile Github and Wikipedia datasets. JumpReLU was used as the activation function for all methods.

  • Sparsity-Loss Frontier: GBA performs comparably to the best TopK (post-activation sparsity) and significantly outperforms L1 regularization and a non-grouped Bias Adaptation (BA) baseline.
  • Hyperparameter Robustness: GBA is nearly tuning-free. Its performance is robust to the number of groups KK and the specific TAFs, provided the Highest TAF (HTF) is adequately high (e.g., 0.1-0.5) and KK is sufficiently large (e.g., 10-20). This is a significant practical advantage.
  • Feature Consistency: Measured by Maximum Cosine Similarity (MCS) across runs with different random seeds, GBA learns significantly more consistent features than TopK. L1 is generally more consistent, but GBA surpasses L1 for the most active (top 0.05%) features.
  • Feature Analysis: Scatter plots of Z-scores vs. other metrics (max activation, activation fraction, MCS) and a feature dashboard example show that GBA learns sparse, selective, and consistent features. For example, high Z-score GBA neurons often correspond to specific, infrequent concepts and show high MCS.

Implementation Considerations

  • Computational Cost: Training SAEs is generally expensive. GBA adds minimal overhead compared to standard SAE training; the bias adaptation step is efficient.
  • Buffer Management: The bias adaptation step requires a buffer of pre-activations. The paper suggests updating biases every 50 gradient steps with the largest batch size hardware permits. Efficiently, only pmp_m and rmr_m need to be tracked, not the full buffer.
  • Activation Function: While theory uses smooth ReLU-like functions, experiments use JumpReLU, which empirically works well with GBA, especially for decoupling neuron output magnitude from its bias.
  • Deployment: Trained SAEs can be used to replace MLP layers in LLMs for interpretability or potentially to steer model behavior. The features learned by GBA are more consistent, making them more reliable for downstream interpretability tasks.

Practical Implications

This research offers a more robust and theoretically grounded method for training SAEs.

  • Improved Interpretability: By recovering more consistent and potentially more monosemantic features, GBA can enhance our understanding of LLM internal workings.
  • Reduced Tuning Effort: The near tuning-free nature of GBA makes it more practical for researchers and practitioners to apply SAEs without extensive hyperparameter searches.
  • Reliable Feature Discovery: Higher consistency implies that the features discovered are less likely to be artifacts of random initialization, leading to more trustworthy interpretations.

The paper lays a strong foundation by bridging theoretical understanding and practical application of SAEs, paving the way for more transparent and trustworthy AI systems. Future work includes extending theoretical guarantees to more general settings and using the learned features for model interventions and circuit discovery.