Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders (2505.11756v1)
Abstract: It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Our work shows there remain fundamental issues with SAEs, but we are hopeful that that highlighting feature hedging will catalyze future advances that allow SAEs to achieve their full potential of interpreting LLMs at scale.
Summary
- The paper introduces feature hedging, where narrow Sparse Autoencoders mix correlated features and degrade the interpretability of LLM activations.
- It demonstrates that hedging arises from MSE reconstruction loss in conditions with correlated and hierarchical feature activations, quantified via a novel hedging degree metric.
- The study proposes balanced Matryoshka SAEs to counteract both hedging and absorption, yielding improved feature separation through controlled experiments.
This paper introduces "feature hedging," a phenomenon in Sparse Autoencoders (SAEs) that can degrade their ability to learn monosemantic (interpretable) features from LLM activations. Feature hedging occurs when an SAE is narrower (has fewer latent dimensions) than the number of "true" underlying features in the data, and these true features are correlated. In such cases, the SAE's reconstruction loss incentivizes its latents to represent a mixture of correlated features rather than a single, pure feature. This is problematic because a core goal of SAEs is to decompose polysemantic LLM activations into distinct, interpretable components.
The authors differentiate feature hedging from a previously identified issue, "feature absorption."
Feature | Feature Hedging | Feature Absorption |
---|---|---|
Effect | Mixes correlated features into latents | Learns gerrymandered latents (general feature suppressed by specific one) |
Cause | MSE reconstruction loss | Sparsity loss |
Feature Representation | Both features partially mixed in one latent | One feature in SAE, other (more general) partially not |
Encoder/Decoder Impact | Affects encoder and decoder symmetrically | Affects encoder and decoder asymmetrically |
SAE Width | Gets worse the narrower the SAE | Gets worse the wider the SAE |
Feature Requirement | Requires only correlation between features | Requires hierarchical features (e.g., parent/child) |
Given that LLM SAEs are almost always narrower than the true number of underlying features and that features in LLMs are likely highly correlated, the paper argues that feature hedging is a prevalent issue.
Studying Hedging in Toy Models
The paper first investigates feature hedging in a controlled setting using a single-latent SAE trained on activations generated from two true features (f1, f2) in R50.
- Independent Features: If f1 and f2 fire independently, the SAE latent correctly learns f1. However, the decoder bias bdec incorrectly learns a component of f2, scaled by its firing probability. This suggests bdec can be seen as tracking an always-on feature.
- Hierarchical Features: If f2 only fires when f1 fires (f2⟹f1), the single SAE latent exhibits hedging by merging a component of f2 into its representation of f1. This mixing occurs symmetrically in both the encoder and decoder. Increasing L1 penalty doesn't solve this because adding a positive component of f2 (which only co-occurs with f1) to the encoder doesn't increase the latent's firing frequency.
- Positively Correlated Features: If f2 is more likely to fire with f1 but can also fire independently, hedging occurs with low L1 penalty. A sufficiently high L1 penalty can mitigate hedging if the correlation is low, as hedging would slightly increase the L0 norm (number of active latents). A full-width SAE (number of latents = number of true features) learns true features despite correlation.
- Anti-correlated Features: If f2 is less likely to fire with f1 than on its own, the SAE latent merges a negative component of f2. Increasing L1 penalty doesn't help, as this negative component doesn't increase L0. Again, a full-width SAE resolves this.
The authors demonstrate that hedging is driven by the MSE reconstruction loss. By analyzing the loss landscape for a single-latent tied SAE with hierarchical features l=αf2+(1−α)f1, they show the loss minimum occurs for 0<α<1, indicating a hedged solution (a mix of f1 and f2) is preferred over representing f1 purely (α=0).
Quantifying Hedging in LLM SAEs
To measure hedging in SAEs trained on LLMs, the paper introduces the "hedging degree" metric, h. The intuition is that if hedging occurs, adding a new latent to an SAE should cause existing latents to "shed" components of the feature now captured by the new latent. This change in existing decoder latents (to distinguish from absorption, which primarily affects encoders) is measured.
The hedging degree h is calculated as follows:
- Train an SAE s0 with L latents.
- Create s1 by adding N new latents to s0 and continue training. Also, continue training s0 (without new latents) on the same data to get s0′.
- Let Wdecs0′[0:L] be the decoder weights of the original L latents in s0′.
- Let Wdecs1[0:L] be the decoder weights of the original L latents in s1.
- Let Wdecs1[L:L+N] be the decoder weights of the N new latents in s1.
- Calculate the difference in the original L latents: δL=Wdecs1[0:L]−Wdecs0′[0:L].
- For each original latent i, the hedging contribution is the magnitude of δL[i] projected onto the subspace of the N new latents, minus its projection onto N random latents (to account for noise):
hi=∣∣Proj(δL[i],Wdecs1[L:L+N])∣∣−∣∣Proj(δL[i],Wrand[0:N])∣∣
- The total hedging degree h is the sum of hi over the L original latents. A value h>0 indicates hedging beyond random noise. The paper uses N=64.
Experiments on Gemma-2-2b and Llama-3.2-1b SAEs show:
- Hedging degree is significantly higher for narrower SAEs (e.g., width ≤4096).
- Even at widths up to 65536, hedging doesn't reach zero.
- L0 (average number of active latents) has a minor effect, with very low L0 slightly increasing hedging for BatchTopK SAEs.
- LLM layer doesn't have a massive impact on hedging degree.
- BatchTopK SAEs tend to have more hedging than L1 SAEs, possibly because L1 loss can mitigate hedging from positively correlated features.
Case Study: Adding a New Latent
A case paper is presented where an L1 SAE (width 8192, Gemma-2-2b layer 12) is trained, then one new latent is added, and training continues.
- The new latent (latent 8192) appears to fire on CSS scripts in HTML.
- The existing latent (latent 3094) that lost the largest component projecting onto this new latent originally tracked the "rel" HTML attribute, often used for linking CSS scripts. This illustrates hedging: latent 3094 was likely hedging by incorporating aspects of "CSS scripts" (a child concept of "rel" attribute usage) before the new, more specific latent was introduced.
Balancing Hedging and Absorption in Matryoshka SAEs
Matryoshka SAEs were proposed to combat feature absorption by using nested SAE loss terms, forcing inner (narrower) levels to reconstruct inputs. However, these narrow inner levels are prone to feature hedging. Thus, Matryoshka SAEs trade absorption for hedging.
The paper notes that for hierarchical features, hedging adds a positive component of child features to the parent encoder latent, while absorption adds a negative component. These opposing effects could potentially cancel out. The authors propose the "balance matryoshka SAE," which modifies the Matryoshka SAE loss:
L=m∈M∑βm(∣∣a−a^m∣∣2+λSm)+αLaux
where βm≥0 is a scaling coefficient for the loss of each matryoshka level m.
- If all βm=1, it's a standard Matryoshka SAE.
- If βm=0 for all but the outermost level, it's a standard SAE.
Toy Model Demonstration:
A toy model with 4 hierarchical features (1 parent, 3 children) and a Matryoshka SAE with one inner level (1 latent) is used.
- High β (for the inner level) leads to hedging (positive child components in parent encoder).
- Low β (effectively a standard SAE) leads to absorption (negative child components).
- A balanced β (e.g., 0.25 in the experiment) results in these effects largely cancelling, yielding a near-perfect representation.
LLM Balance Matryoshka SAEs:
Experiments on Gemma-2-2b layer 12 with BatchTopK Matryoshka SAEs (5 levels, outermost width 32768) vary the multiplier βm/βm+1 (with outermost β5=1).
- A multiplier around 0.75 generally improves performance on metrics like Targeted Probe Perturbation (TPP), feature splitting, and k-sparse probing (including a custom Parts of Speech dataset) compared to standard Matryoshka SAEs (multiplier=1) or standard SAEs (multiplier=0).
- This balanced approach still performs well on absorption metrics. The paper acknowledges that a single multiplier might not perfectly balance hedging and absorption for all features, especially if child features have varying firing probabilities.
Implementation Considerations
- SAE Training: Standard SAE training procedures are used (e.g., SAELens library, Adam optimizer, L1 penalty for L1 SAEs, TopK for BatchTopK SAEs).
- Initialization: Encoder and decoder weights identical, latents with norm 0.1.
- L1 SAEs: Learning rate 7e-5, L1 warm-up.
- BatchTopK SAEs: Learning rate 3e-4.
- Hedging Degree Calculation: Requires training two SAEs (original and extended) for a substantial number of tokens after the extension point. This can be computationally intensive.
- When extending L1 SAEs, especially with high L1 penalty, new latents can die. A re-warm-up of L1 penalty (capped at a minimum λmin) is used to prevent this while minimizing disturbance to existing latents.
- Balance Matryoshka SAEs:
- The primary change is the introduction of βm coefficients in the loss function. This requires modifying the training loop to weight the reconstruction and sparsity losses for each Matryoshka prefix.
- Finding optimal βm values (or a common multiplier) likely requires hyperparameter tuning based on downstream interpretability or performance metrics. The paper explores a geometric progression for βm.
- Computational Requirements: Training large SAEs (e.g., >65k latents) on substantial data (e.g., 250M-500M tokens) requires significant GPU resources (e.g., H100).
- Potential Limitations of the Study:
- Experiments are limited to SAEs up to 65k latents and LLMs up to ~2B parameters.
- The hedging degree metric is computationally expensive.
- The balancing in Matryoshka SAEs might not be perfect for all features simultaneously with a single set of βm coefficients, suggesting per-latent or more adaptive balancing schemes as future work.
The paper concludes that feature hedging is a significant issue hindering SAE performance and that understanding it can lead to improved SAE architectures like the proposed balance matryoshka SAE. The code is made available at https://github.com/chanind/feature-hedging-paper
.
Related Papers
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders (2025)
- Sparse Autoencoders Do Not Find Canonical Units of Analysis (2025)
- Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures (2025)
- Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders (2025)
- Dense SAE Latents Are Features, Not Bugs (2025)