Dense SAE Latents Are Features, Not Bugs (2506.15679v1)

Published 18 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Sparse autoencoders (SAEs) are designed to extract interpretable features from LLMs by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs -- suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in LLM computation and should not be dismissed as training noise.

Authors (7)

Xiaoqing Sun (11 papers)
Alessandro Stolfo (12 papers)
Joshua Engels (14 papers)
Ben Wu (16 papers)
Senthooran Rajamanoharan (11 papers)
Mrinmaya Sachan (124 papers)
Max Tegmark (133 papers)

Summary

Sparse autoencoders (SAEs) are a technique used to find interpretable features within the activations of LLMs by training an autoencoder with a sparsity constraint on its hidden layer. While ideally, these "latents" (features) would be sparse and easily interpretable, a significant number of latents often activate frequently (i.e., are "dense"). This paper investigates these dense SAE latents and argues that they represent meaningful, intrinsically dense features in the LLM's residual stream, rather than being mere artifacts of the SAE training process.

The authors present several lines of evidence for the functional role of dense latents:

Intrinsic Nature: Through an ablation experiment on Gemma 2 2B residual stream activations, the authors show that training an SAE on activations where the subspace spanned by dense latents from a pre-trained SAE has been zero-ablated results in far fewer dense latents being learned compared to ablating a sparse-latent subspace or training on original activations. This suggests that dense latents emerge because they are reconstructing an inherently dense subspace present in the LLM's residual stream, implying they are not simply training noise. The number of dense latents also quickly stabilizes during SAE training, further supporting this.
Antipodal Pairing: Dense latents frequently appear in antipodal pairs, meaning their encoder and decoder vectors are nearly opposite in direction. The authors introduce an antipodality score $s_i$ for latent $i$ , calculated as the maximum product of encoder and decoder cosine similarity with any other latent $j$ :

$s_i := \max_{j \neq i} \left( \operatorname{sim}(W_{\mathrm{enc},i}, W_{\mathrm{enc},j}) \cdot \operatorname{sim}(W_{\mathrm{dec},i}, W_{\mathrm{dec},j}) \right)$

High values of $s_i$ indicate an antipodal pair. The paper shows a strong positive correlation between activation density and antipodality score across different models and SAE architectures (JumpReLU, TopK), indicating that dense latents often form such pairs to reconstruct a 1-dimensional direction in the residual stream.
Functional Taxonomy: The paper identifies several classes of dense latents based on their activation patterns and function:
- Position Latents: These track the token's position relative to structural boundaries like the beginning of the context, paragraph, or sentence. Identified using Spearman correlation between latent activation projections and distance from boundaries. These are prominent in early layers.
- Context-Binding Latents: Found primarily in middle layers, these latents activate on coherent "chunks" of text and seem to track context-dependent semantic concepts. A steering experiment is used to show that ablating or amplifying the direction corresponding to an antipodal pair of these latents can influence model completions to align with specific concepts that activated the latent within that context. This suggests they might act as "registers" for active ideas.
- Nullspace Latents: These are identified by having a significant portion of their encoder weight norm aligned with the final singular vectors of the model's unembedding matrix $W$ . The fraction of norm $\alpha_k$ in the bottom $k$ singular vectors is used as a metric. These latents are often uninterpretable based on token activations but are shown to be functionally related to output entropy regulation, potentially via interaction with RMSNorm. Ablating these latents can significantly change output entropy, and this effect is reduced when RMSNorm scaling is frozen.
- Alphabet Latents: Predominantly in the final layers, these latents influence the logit scores of next tokens based on their initial letter or prefix. Identified by analyzing the decoder weight projections onto the vocabulary space. They represent output-oriented signals related to lexical structure.
- Meaningful-Word Latents: These latents correlate with the Part-of-Speech tag of tokens, particularly "meaningful words" (nouns, verbs, adjectives, adverbs). Identified by calculating the AUC of predicting latent activation from binary POS category membership. More prevalent in early layers.
- PCA Latents: While one might expect dense latents to reconstruct principal components of the residual stream, the paper finds that only the first principal component is consistently reconstructed by an antipodal pair of latents.
Layer-wise Dynamics: The distribution and characteristics of dense latents change across model layers. Early layers feature more structural/token-dependent latents (position, meaningful-word). Middle layers contain more conceptual/context-dependent latents (context-binding). Late layers show an increase in dense latents tied to output signals (alphabet, nullspace). Analysis of the principal angles between dense-latent subspaces across layers reveals distinct clusters in early (0-4), middle (10-22), and late layers, indicating structural shifts in the dense features learned.

Practical Implementation Considerations:

SAE Training: The paper uses both JumpReLU (Rajamanoharan et al., 19 Jul 2024 ) and TopK (Sun et al., 18 Jun 2025 ) SAE architectures, trained on large text corpora (OpenWebText [gokaslan2019openweb], C4 [c4]). Implementing such training requires significant computational resources. The authors estimate around 30 A6000 hours for training and activation capture for their experiments.
Analysis Techniques: Identifying and analyzing dense latents involves standard linear algebra (cosine similarity, SVD, principal angles), statistical methods (Spearman correlation, AUC-ROC), and interpretability techniques like analyzing logit contributions and using LLMs for auto-interpretation and steering experiments. Tools like PyTorch [NEURIPS2019_bdbca288], TransformerLens [nanda2022transformerlens], NumPy [harris2020array], Pandas [mckinney-proc-scipy-2010], and Plotly [plotly] are practical for these analyses.
LLM Judging & Steering: Techniques like causal steering and using LLM judges (e.g., Gemini 2.5 Flash Preview [google2025gemini] via OpenRouter) for interpretation require careful experimental design and can incur API costs (estimated $<\$20$ for judging in the paper). Steering involves modifying residual stream activations based on latent directions.
Subspace Ablation: The ablation experiments demonstrate that manipulating specific subspaces derived from SAE latents can alter model behavior or SAE learning, a technique useful for probing the functional significance of identified features.
Limitations: The paper notes that interpreting dense latents is challenging. Not all dense latents are explained by the proposed taxonomy. Some may represent noisy aggregations of sparse features or a complex basis that doesn't perfectly align with "true" dense features due to the nature of linear combinations and reconstruction objectives. The analysis primarily focuses on Gemma 2 2B with specific SAE configurations, and results may vary across models and hyperparameters.

The core practical implication is that dense SAE latents should not be automatically discarded or suppressed (e.g., via loss penalties [removing-dense-latents]), as they capture meaningful and functionally relevant information within LLMs. This motivates the development of future feature extraction techniques that can effectively identify and interpret both sparse and dense representations.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/alesstolfo/status/1937144581801566654

https://twitter.com/lilysun004/status/1936116603802014102

https://twitter.com/PapersInML/status/1935763147173896532