Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders (2505.11756v1)

Published 16 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Our work shows there remain fundamental issues with SAEs, but we are hopeful that that highlighting feature hedging will catalyze future advances that allow SAEs to achieve their full potential of interpreting LLMs at scale.

Summary

The paper introduces feature hedging, where narrow Sparse Autoencoders mix correlated features and degrade the interpretability of LLM activations.
It demonstrates that hedging arises from MSE reconstruction loss in conditions with correlated and hierarchical feature activations, quantified via a novel hedging degree metric.
The study proposes balanced Matryoshka SAEs to counteract both hedging and absorption, yielding improved feature separation through controlled experiments.

This paper introduces "feature hedging," a phenomenon in Sparse Autoencoders (SAEs) that can degrade their ability to learn monosemantic (interpretable) features from LLM activations. Feature hedging occurs when an SAE is narrower (has fewer latent dimensions) than the number of "true" underlying features in the data, and these true features are correlated. In such cases, the SAE's reconstruction loss incentivizes its latents to represent a mixture of correlated features rather than a single, pure feature. This is problematic because a core goal of SAEs is to decompose polysemantic LLM activations into distinct, interpretable components.

The authors differentiate feature hedging from a previously identified issue, "feature absorption."

Feature	Feature Hedging	Feature Absorption
Effect	Mixes correlated features into latents	Learns gerrymandered latents (general feature suppressed by specific one)
Cause	MSE reconstruction loss	Sparsity loss
Feature Representation	Both features partially mixed in one latent	One feature in SAE, other (more general) partially not
Encoder/Decoder Impact	Affects encoder and decoder symmetrically	Affects encoder and decoder asymmetrically
SAE Width	Gets worse the narrower the SAE	Gets worse the wider the SAE
Feature Requirement	Requires only correlation between features	Requires hierarchical features (e.g., parent/child)

Given that LLM SAEs are almost always narrower than the true number of underlying features and that features in LLMs are likely highly correlated, the paper argues that feature hedging is a prevalent issue.

Studying Hedging in Toy Models

The paper first investigates feature hedging in a controlled setting using a single-latent SAE trained on activations generated from two true features ( $f_1$ , $f_2$ ) in $\mathbb{R}^{50}$ .

Independent Features: If $f_1$ and $f_2$ fire independently, the SAE latent correctly learns $f_1$ . However, the decoder bias $b_{dec}$ incorrectly learns a component of $f_2$ , scaled by its firing probability. This suggests $b_{dec}$ can be seen as tracking an always-on feature.
Hierarchical Features: If $f_2$ only fires when $f_1$ fires ( $f_2 \implies f_1$ ), the single SAE latent exhibits hedging by merging a component of $f_2$ into its representation of $f_1$ . This mixing occurs symmetrically in both the encoder and decoder. Increasing L1 penalty doesn't solve this because adding a positive component of $f_2$ (which only co-occurs with $f_1$ ) to the encoder doesn't increase the latent's firing frequency.
Positively Correlated Features: If $f_2$ is more likely to fire with $f_1$ but can also fire independently, hedging occurs with low L1 penalty. A sufficiently high L1 penalty can mitigate hedging if the correlation is low, as hedging would slightly increase the L0 norm (number of active latents). A full-width SAE (number of latents = number of true features) learns true features despite correlation.
Anti-correlated Features: If $f_2$ is less likely to fire with $f_1$ than on its own, the SAE latent merges a negative component of $f_2$ . Increasing L1 penalty doesn't help, as this negative component doesn't increase L0. Again, a full-width SAE resolves this.

The authors demonstrate that hedging is driven by the MSE reconstruction loss. By analyzing the loss landscape for a single-latent tied SAE with hierarchical features $l = \alpha f_2 + (1-\alpha)f_1$ , they show the loss minimum occurs for $0 < \alpha < 1$ , indicating a hedged solution (a mix of $f_1$ and $f_2$ ) is preferred over representing $f_1$ purely ( $\alpha=0$ ).

Quantifying Hedging in LLM SAEs

To measure hedging in SAEs trained on LLMs, the paper introduces the "hedging degree" metric, $h$ . The intuition is that if hedging occurs, adding a new latent to an SAE should cause existing latents to "shed" components of the feature now captured by the new latent. This change in existing decoder latents (to distinguish from absorption, which primarily affects encoders) is measured.

The hedging degree $h$ is calculated as follows:

Train an SAE $s_0$ with $L$ latents.
Create $s_1$ by adding $N$ new latents to $s_0$ and continue training. Also, continue training $s_0$ (without new latents) on the same data to get $s'_0$ .
Let $W_{dec}^{s'_0}[0:L]$ be the decoder weights of the original $L$ latents in $s'_0$ .
Let $W_{dec}^{s_1}[0:L]$ be the decoder weights of the original $L$ latents in $s_1$ .
Let $W_{dec}^{s_1}[L:L+N]$ be the decoder weights of the $N$ new latents in $s_1$ .
Calculate the difference in the original $L$ latents: $\delta_L = W_{dec}^{s_1}[0:L] - W_{dec}^{s'_0}[0:L]$ .
For each original latent $i$ , the hedging contribution is the magnitude of $\delta_L[i]$ projected onto the subspace of the $N$ new latents, minus its projection onto $N$ random latents (to account for noise):

$h_i = ||\text{Proj}(\delta_L[i], W_{dec}^{s_1}[L:L+N])|| - ||\text{Proj}(\delta_L[i], W_{\text{rand}}[0:N])||$
The total hedging degree $h$ is the sum of $h_i$ over the $L$ original latents. A value $h > 0$ indicates hedging beyond random noise. The paper uses $N=64$ .

Experiments on Gemma-2-2b and Llama-3.2-1b SAEs show:

Hedging degree is significantly higher for narrower SAEs (e.g., width $\le 4096$ ).
Even at widths up to 65536, hedging doesn't reach zero.
L0 (average number of active latents) has a minor effect, with very low L0 slightly increasing hedging for BatchTopK SAEs.
LLM layer doesn't have a massive impact on hedging degree.
BatchTopK SAEs tend to have more hedging than L1 SAEs, possibly because L1 loss can mitigate hedging from positively correlated features.

Case Study: Adding a New Latent

A case paper is presented where an L1 SAE (width 8192, Gemma-2-2b layer 12) is trained, then one new latent is added, and training continues.

The new latent (latent 8192) appears to fire on CSS scripts in HTML.
The existing latent (latent 3094) that lost the largest component projecting onto this new latent originally tracked the "rel" HTML attribute, often used for linking CSS scripts. This illustrates hedging: latent 3094 was likely hedging by incorporating aspects of "CSS scripts" (a child concept of "rel" attribute usage) before the new, more specific latent was introduced.

Balancing Hedging and Absorption in Matryoshka SAEs

Matryoshka SAEs were proposed to combat feature absorption by using nested SAE loss terms, forcing inner (narrower) levels to reconstruct inputs. However, these narrow inner levels are prone to feature hedging. Thus, Matryoshka SAEs trade absorption for hedging.

The paper notes that for hierarchical features, hedging adds a positive component of child features to the parent encoder latent, while absorption adds a negative component. These opposing effects could potentially cancel out. The authors propose the "balance matryoshka SAE," which modifies the Matryoshka SAE loss:

$\mathcal{L} = \sum_{m \in M} \beta_m (||a - \hat{a}_m||^2 + \lambda S_m) + \alpha \mathcal{L}_{aux}$

where $\beta_m \ge 0$ is a scaling coefficient for the loss of each matryoshka level $m$ .

If all $\beta_m=1$ , it's a standard Matryoshka SAE.
If $\beta_m=0$ for all but the outermost level, it's a standard SAE.

Toy Model Demonstration:

A toy model with 4 hierarchical features (1 parent, 3 children) and a Matryoshka SAE with one inner level (1 latent) is used.

High $\beta$ (for the inner level) leads to hedging (positive child components in parent encoder).
Low $\beta$ (effectively a standard SAE) leads to absorption (negative child components).
A balanced $\beta$ (e.g., 0.25 in the experiment) results in these effects largely cancelling, yielding a near-perfect representation.

LLM Balance Matryoshka SAEs:

Experiments on Gemma-2-2b layer 12 with BatchTopK Matryoshka SAEs (5 levels, outermost width 32768) vary the multiplier $\beta_m / \beta_{m+1}$ (with outermost $\beta_5=1$ ).

A multiplier around 0.75 generally improves performance on metrics like Targeted Probe Perturbation (TPP), feature splitting, and k-sparse probing (including a custom Parts of Speech dataset) compared to standard Matryoshka SAEs (multiplier=1) or standard SAEs (multiplier=0).
This balanced approach still performs well on absorption metrics. The paper acknowledges that a single multiplier might not perfectly balance hedging and absorption for all features, especially if child features have varying firing probabilities.

Implementation Considerations

SAE Training: Standard SAE training procedures are used (e.g., SAELens library, Adam optimizer, L1 penalty for L1 SAEs, TopK for BatchTopK SAEs).
- Initialization: Encoder and decoder weights identical, latents with norm 0.1.
- L1 SAEs: Learning rate 7e-5, L1 warm-up.
- BatchTopK SAEs: Learning rate 3e-4.
Hedging Degree Calculation: Requires training two SAEs (original and extended) for a substantial number of tokens after the extension point. This can be computationally intensive.
- When extending L1 SAEs, especially with high L1 penalty, new latents can die. A re-warm-up of L1 penalty (capped at a minimum $\lambda_{min}$ ) is used to prevent this while minimizing disturbance to existing latents.
Balance Matryoshka SAEs:
- The primary change is the introduction of $\beta_m$ coefficients in the loss function. This requires modifying the training loop to weight the reconstruction and sparsity losses for each Matryoshka prefix.
- Finding optimal $\beta_m$ values (or a common multiplier) likely requires hyperparameter tuning based on downstream interpretability or performance metrics. The paper explores a geometric progression for $\beta_m$ .
Computational Requirements: Training large SAEs (e.g., >65k latents) on substantial data (e.g., 250M-500M tokens) requires significant GPU resources (e.g., H100).
Potential Limitations of the Study:
- Experiments are limited to SAEs up to 65k latents and LLMs up to ~2B parameters.
- The hedging degree metric is computationally expensive.
- The balancing in Matryoshka SAEs might not be perfect for all features simultaneously with a single set of $\beta_m$ coefficients, suggesting per-latent or more adaptive balancing schemes as future work.

The paper concludes that feature hedging is a significant issue hindering SAE performance and that understanding it can lead to improved SAE architectures like the proposed balance matryoshka SAE. The code is made available at https://github.com/chanind/feature-hedging-paper.

PDF Markdown

Related Papers

Tweets

https://twitter.com/chanindav/status/1929537520561246685