Sparse Autoencoder Feature Labels

Updated 4 July 2026

Sparse autoencoder feature labels are human-readable descriptions attached to specific latent dimensions in an overcomplete, sparse representation, clarifying complex concept disentanglement.
They are generated through methods that inspect top activations, apply intervention protocols, and leverage training geometry and hierarchical structuring for robust validation.
Recent research emphasizes that while these labels capture dominant activation patterns, they remain approximate and require careful contextual interpretation to guide further analysis.

Searching arXiv for papers on sparse autoencoder feature labeling, hierarchy, and interpretability. Sparse autoencoder feature labels are human-readable descriptions assigned to latent dimensions in a sparse autoencoder (SAE), typically with the aim of identifying the concept, pattern, or function that a feature encodes. In contemporary mechanistic-interpretability practice, an SAE feature is usually understood as a latent coordinate in an overcomplete sparse code together with its associated learned decoder direction, rather than as a base-model neuron or raw hidden-state dimension (Zhang et al., 25 May 2026). Research on feature labels has increasingly emphasized that labeling quality depends not only on inspection of top activations, but also on the training geometry of the SAE, the stability of feature identities, the intervention protocol used for validation, and the possibility that semantic structure is hierarchical rather than flat (Li et al., 9 Oct 2025). Across recent work, the field converges on a cautious position: SAE features are often labelable, but labels are generally approximate descriptions of dominant activation patterns rather than exact names for isolated causal units (Sharma, 13 Jun 2026).

1. Formal status of an SAE feature

In recent SAE-based language-model work, a feature is most naturally defined as a single latent coordinate in a sparse, overcomplete representation, together with its decoder direction in activation space (Zhang et al., 25 May 2026). In the Gated SAE used by SAE-FD, given an activation vector $\mathbf{h} \in \mathbb{R}^d$ , the encoder computes

$\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$

$\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$

$\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$

$\mathbf{f} = \text{ReLU}(\mathbf{f}_{\text{pre}})$

with reconstruction

$\hat{\mathbf{h}} = W_{\text{dec}} \mathbf{f} + \mathbf{b}_{\text{dec}}.$

The paper explicitly interprets SAEs as dictionary-learning systems that decompose dense activations into sparse linear combinations of learned feature directions, and states that the decoder matrix has unit-norm rows (Zhang et al., 25 May 2026).

This formalization matters for labeling because it locates the object being named. A label does not ordinarily attach to an input token, output behavior, or raw neuron; it attaches to a latent unit in the SAE dictionary. In SAE-FD, the activation space is the final transformer-layer MLP output space of the base LLM, with $d=4096$ , latent width $D=32768$ , and $8\times$ expansion (Zhang et al., 25 May 2026). In other work, the exact encoded space differs—for example, Gemma-2-2B layer-12 activations with $d=2304$ and $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 0 in ATM (Li et al., 9 Oct 2025), or ConvNeXt-Tiny final-stage patch activations with input dimension 768 and latent dimension 8192 in aircraft vision experiments (Sharma, 13 Jun 2026)—but the labeling target remains the same: a sparse latent direction.

The literature repeatedly motivates these features as more decoupled than dense coordinates. SAE-FD states that sparse, overcomplete representations “disentangle superimposed concepts,” and that the resulting features are “more decoupled than the raw dimensions” (Zhang et al., 25 May 2026). This suggests that SAE features are better candidates for labeling than dense hidden-state axes, but it does not by itself establish that a given latent admits a clean human-readable name.

2. How labels are assigned in practice

The standard labeling workflow in SAE interpretability has been to inspect top-activating contexts for a feature and assign a concise description, sometimes followed by a steering-based sanity check (Riegler et al., 4 May 2026). That protocol remains common, but recent work has shown that it is insufficient as a general validation criterion. The pairwise-matrix paper argues that top-context labels are often only locally true: they may describe one activation regime while missing the broader causal axis governed by the decoder direction (Riegler et al., 4 May 2026).

In domains where qualitative grounding is more explicit, labels are often assigned by combining several evidence sources. SALVE, for example, does not implement automatic textual naming, but supports human labeling through class-conditional latent activation statistics, activation maximization, Grad-FAM localization, top-activating examples, and causal intervention (Flovik, 17 Dec 2025). In vision, Grad-FAM attributes latent activation $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 1 rather than a class logit: $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 2

$\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 3

This produces a feature-specific heatmap and supports labels such as “golf ball,” “golf ball surface texture,” “church,” or “Tower Feature” when those descriptions are corroborated by localization and intervention effects (Flovik, 17 Dec 2025).

A related visual workflow appears in aircraft representation analysis. There, features are ranked by mean activation over their top 50 patches, and the top-activating patches plus full source images are inspected manually (Sharma, 13 Jun 2026). The resulting labels are explicitly described as approximate descriptions of dominant activation patterns. Reported examples include “Dominant pattern of flap track fairings; Secondary activations on various regions of fuselage,” “Dominant pattern of cockpit region of old warplanes; Occasional activations on the nose section,” and “Dominant pattern of struts in between the upper and lower wings of biplanes” (Sharma, 13 Jun 2026). This methodology makes clear that a feature label is often a structured summary of recurrent evidence, not a definitive ontological statement.

In non-text domains where direct human inspection is harder, labeling may rely more heavily on supervised probes. In voice-embedding work, the authors first found raw audio inspection unexpectedly difficult and instead trained logistic regression on SAE latents for externally labeled concepts such as language and music (Pluth et al., 31 Jan 2025). A single latent in a 200-dimensional TopK SAE acted as a Spanish detector with precision $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 4 and recall $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 5, while another acted as an IVR music detector with precision $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 6 and recall $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 7 (Pluth et al., 31 Jan 2025). The inferred labels were then refined using error analysis and steering. This suggests that in some domains the most reliable labels are probe-validated, dataset-conditioned semantic descriptors rather than names assigned directly from top examples.

3. Reliability, instability, and failure modes of labels

A central recent theme is that labelability depends on feature stability. The ATM paper makes feature absorption a first-class training pathology: one latent may subsume multiple distinct concepts because doing so reduces sparsity cost, producing inconsistent or partial feature coverage (Li et al., 9 Oct 2025). The paper’s motivating examples include a “starts with E” feature being absorbed into a narrower “elephant” feature, or a narrower “India” concept being absorbed into a broader “Asia” concept (Li et al., 9 Oct 2025). For labeling, this means a feature can appear monosemantic from its top activations yet fail to fire where its label would predict.

ATM proposes Adaptive Temporal Masking to reduce this instability by tracking temporal feature statistics and masking low-importance features probabilistically rather than using rigid instantaneous thresholds (Li et al., 9 Oct 2025). Its main quantitative evidence is a substantially lower absorption score:

ATM: 0.0068
TopK SAE: 0.1402
JumpReLU: 0.0114
SAE: 0.0161 (Li et al., 9 Oct 2025)

The paper does not measure labels directly, but it explicitly frames reduced absorption as a prerequisite for consistent feature analysis. A plausible implication is that lower absorption should yield labels with fewer hidden exceptions and more reproducible activation boundaries (Li et al., 9 Oct 2025).

A closely related line of work is OrtSAE, which targets feature absorption and feature composition by penalizing high pairwise cosine similarity between decoder features (Korznikov et al., 26 Sep 2025). OrtSAE reports 9% more distinct features, 65% less absorption, and 15% less composition than a BatchTopK baseline (Korznikov et al., 26 Sep 2025). Qualitative examples are directly relevant to label quality: a BatchTopK “Queen” feature decomposes into a token-specific “Queen” feature and a broader “royal titles / royalty concepts” feature; a “jaw” feature decomposes into “jaw token,” “mouth/oral concepts,” and “aw” token (Korznikov et al., 26 Sep 2025). These cases show how ambiguous, mixed labels can be replaced by shorter and more specific descriptions when training yields more atomic features.

AEN-SAE frames the same problem geometrically. It argues that dead features, unstable support, and shrinkage bias arise from the instability of $\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 8-induced sparse coding in coherent overcomplete dictionaries, and proposes an adaptive elastic-net objective

$\mathbf{g} = \sigma(W_{\text{gate}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}})$ 9

with adaptive reweighting and a Lipschitz-continuous code map (Chaudhry et al., 6 May 2026). The paper’s contribution to labeling is indirect but substantial: it argues that a feature can only be a good object for naming if it is sufficiently utilized, not arbitrarily unstable, and not permanently dead. Empirically, AEN-SAE reduces dead features and spreads usage more evenly than TopK, increasing the number of features that can plausibly be inspected and named (Chaudhry et al., 6 May 2026).

4. Why top-activation labels are often insufficient

The strongest critique of conventional feature labeling is the argument that top-activation inspection plus single-feature steering only probes “one corner of the matrix” (Riegler et al., 4 May 2026). The pairwise-matrix protocol expands validation along two axes: steering coefficient $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 0 and joint-condition structure. The key claim is that a feature label derived from top activations can be only locally correct.

The canonical example is Qwen feature $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 1, initially labeled “AI self-disclaimer” from top contexts such as “I don’t have personal thoughts or emotions, but…” (Riegler et al., 4 May 2026). Under a coefficient sweep across $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 2, the disclaimer rate is

$\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 3

while the philosophy-cluster rate is

$\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 4

(Riegler et al., 4 May 2026). At $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 5, the feature does not strengthen the disclaimer; it instead produces a coherent contemplative-philosopher voice. The original label is therefore incomplete: it names only one operating regime of a broader causal direction (Riegler et al., 4 May 2026).

The same paper shows that single-feature suppression can miss structural dependence revealed only by joint interventions. Three near-orthogonal features that individually steer philosophy-of-mind content appear harmless when suppressed one at a time, but jointly suppressing them at $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 6 causes grounded control prompts to collapse into placeholder-like output such as “BASIC TOMOATO SOUP RECIOPLEY” and “Clamp Clamp (CCL)” (Riegler et al., 4 May 2026). This demonstrates that a feature’s causal role may be distributed over a small subspace rather than localized to a single latent.

For labeling, the consequence is methodological. Top contexts alone support at best a local observational label. A stronger label should survive:

a coefficient sweep,
coherence checks at the inflection,
joint perturbation with nearby or cluster-related features,
unrelated control prompts,
matched-geometry controls (Riegler et al., 4 May 2026).

This does not invalidate top-context labels; it reclassifies them as provisional descriptions until causal probing has established whether the feature truly corresponds to the named concept across regimes.

5. Hierarchical and relational labeling

Several papers argue that many labeling difficulties arise because SAE features are not best understood as a flat set of isolated atoms. “The Geometry of Concepts” shows that SAE decoder vectors form a structured “concept universe” at three scales: local crystals, intermediate-scale lobes, and large-scale anisotropic geometry (Li et al., 2024). At the small scale, local parallelogram- or trapezoid-like structures support relation labels, such as country-to-capital transformations; at intermediate scale, clusters of co-occurring features form functionally coherent lobes, such as code/math, short dialogue, or scientific papers (Li et al., 2024). This suggests that labels should often be relational or group-level rather than feature-local.

The paper’s mesoscale results are especially relevant. Using Gemma-2-2B layer-12 residual-stream SAEs, it finds that features that co-occur within 256-token blocks also cluster geometrically far more than chance, with the Phi coefficient producing the strongest lobe alignment and significance levels of 954 standard deviations for adjusted mutual information and 74 standard deviations for geometry-to-lobe prediction (Li et al., 2024). A plausible implication is that unlabeled features can inherit strong priors from their neighborhood or lobe membership. A feature in the code/math lobe is not yet labeled, but it is already situated in a semantically narrower hypothesis space.

The most explicit treatment of hierarchical labeling is HSAE, which jointly trains multiple SAEs at increasing dictionary sizes and learns parent-child relations between features (Luo et al., 12 Feb 2026). For input $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 7, level $\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 8 is

$\mathbf{m} = W_{\text{enc}} (\mathbf{h} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}$ 9

and a parent-child structural constraint encourages a parent feature to be approximated by the sum of its children: $\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 0 when the child set is nonempty (Luo et al., 12 Feb 2026). The full objective adds this structural loss across all adjacent levels (Luo et al., 12 Feb 2026).

The practical implication is that labels should often be hierarchical. In the Science hierarchy, a broad feature “Science, Technology, and Research” branches into “Scientific Disciplines,” which then splits into a lexical “science/scientists” pattern and field names like biology or physics (Luo et al., 12 Feb 2026). In the Time hierarchy, a broad temporal feature splits into Daily Timescale and Longer Timescale, then into “Today/Tonight,” “Day,” or “Week” (Luo et al., 12 Feb 2026). Such cases show that apparent feature splitting is not necessarily label instability; it may be granularity refinement. A flat labeler faces the question “Is this feature science, scientists, or scientific terminology?” HSAE answers that these labels belong at different abstraction levels.

6. Training geometry, architecture, and their effects on labelability

Recent work increasingly treats labeling quality as downstream of SAE training geometry. One prominent example is cosine-scored SAEs. Standard SAE encoders score features by inner product, so activation scales with both alignment and input norm: $\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 1 On normalized transformer representations, the paper argues that this is the wrong geometry: sublayers read direction after RMSNorm, but the encoder still rewards norm, causing high-norm tokens to monopolize BatchTopK slots and producing norm detectors rather than content features (Naihin et al., 13 Jun 2026). The proposed cosine-style score is

$\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 2

with $\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 3 giving cosine-like scoring and $\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 4 recovering inner-product dependence up to scale (Naihin et al., 13 Jun 2026).

At matched reconstruction on Qwen3-8B layer 18, the per-feature adaptive cosine SAE achieves top-1 sparse probing 0.815 versus 0.667 for standard scoring, with only negligible FVE difference (Naihin et al., 13 Jun 2026). The authors interpret this as evidence that cosine scoring allocates more dictionary slots to human-recognizable concepts and fewer to norm detectors. For labeling, the implication is direct: the forward-pass score geometry affects not only reconstruction but what kinds of features exist to be named.

Feature-allocation mechanisms also matter. The “Adaptive Sparse Allocation” paper reframes sparsity as token-feature resource allocation and proposes Feature Choice and Mutual Choice SAEs, which relax the TopK constraint of a fixed number of features per token (Ayonrinde, 2024). Feature Choice allows each feature to match a limited number of tokens, often following a Zipf-like prevalence schedule; Mutual Choice allocates a global activation budget to the strongest token-feature affinities (Ayonrinde, 2024). The paper’s most salient result for labeling is utilization: Feature Choice achieves 0% dead features at 6k, 16m, and 34m scales, whereas other methods retain substantial dead-feature rates (Ayonrinde, 2024). This implies far broader label coverage, because a dead feature cannot be labeled and a severely underused feature provides too little evidence for robust naming.

KronSAE alters the unit of labeling by factorizing the encoder into head-wise base and extension pre-latents whose pairwise interactions generate the actual post-latent features (Kurochkin et al., 28 May 2025). For each head,

$\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 5

and post-latents are formed by

$\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 6

The paper’s qualitative analysis suggests that post-latents are usually more interpretable and more specific than their pre-latent parents, while pre-latents are broader and often polysemantic (Kurochkin et al., 28 May 2025). This suggests that in factorized SAE architectures, the most natural labels attach to composed post-latents, with parent-factor descriptions serving as explanatory metadata rather than primary labels.

7. Domain-specific evidence and the limits of semantic labels

Across modalities, the literature supports the possibility of feature labels but also exposes important limits.

In language-model continual learning, SAE-FD uses SAE features functionally rather than interpretively. It stores per-token last-layer MLP activations, encodes them with a frozen SAE, and preserves representation direction and active-feature magnitudes across tasks via feature-space distillation (Zhang et al., 25 May 2026). The method assumes that the active feature set

$\mathbf{f}_{\text{pre}} = \mathbf{g} \odot \mathbf{m}$ 7

is the subset worth protecting (Zhang et al., 25 May 2026). This is indirect evidence for selective semantic meaning, but the paper does not enumerate or label individual features.

In voice embeddings, feature labels are demonstrably possible but dataset-conditioned. A single feature can act as “Spanish” or “IVR music,” yet the error analysis shows that “Spanish” partly captures Spanish-speaker acoustic properties rather than a purely transcript-level language concept (Pluth et al., 31 Jan 2025). Moreover, as latent width increases, “Spanish” splits into male and female subfeatures, illustrating that a label can be correct only at a particular dictionary size (Pluth et al., 31 Jan 2025). This argues for cautious, domain-aware naming such as “Spanish-associated speech” rather than universal conceptual claims.

In aircraft vision, feature labels such as “rear landing gear,” “fuselage windows on commercial airliners,” or “cockpit region of old warplanes” are meaningful and class-relevant, but feature-space suppression effects are small relative to input-space ablations (Sharma, 13 Jun 2026). The features are thus useful pointers to model-relevant evidence without necessarily being independently decisive causal units. The paper explicitly concludes that sparse features are partially interpretable yet still polysemantic and coarsely localized (Sharma, 13 Jun 2026).

These cases converge on a general pattern. Feature labels are most trustworthy when supported by several of the following:

coherent top-activating examples,
stability across many examples rather than only the strongest few,
meaningful error analysis,
causal relevance under intervention,
robustness across nearby activation regimes,
geometric or hierarchical context.

Even then, labels often remain approximate. The aircraft paper states this most plainly, but the same caution is implicit throughout the literature (Sharma, 13 Jun 2026).

8. Common misconceptions and points of controversy

A recurring misconception is that an SAE feature label is equivalent to proof of monosemanticity. Recent work repeatedly rejects this. Features can be partially interpretable yet still mixed, absorption-prone, context-dependent, or only locally well described (Li et al., 9 Oct 2025). The pairwise-matrix results show that a feature may admit a plausible top-context label while governing a broader causal axis than that label suggests (Riegler et al., 4 May 2026).

Another misconception is that better reconstruction guarantees better labels. AEN-SAE argues the opposite: reconstruction alone can coexist with dead features, unstable support, heavy redundancy, or collapsed usage, all of which degrade the effective labelable inventory (Chaudhry et al., 6 May 2026). ATM and OrtSAE similarly show that geometric pathologies like absorption and composition can undermine label reliability even when reconstruction quality remains strong (Li et al., 9 Oct 2025).

A further misconception is that there should be a single canonical label per feature across runs or dictionary sizes. Feature splitting, basis variation, and non-canonical decompositions argue against this (Luo et al., 12 Feb 2026). In SALVE, exact latent indices vary across seeds even when the intervention effect on the underlying concept is stable (Flovik, 17 Dec 2025). In voice embeddings, “Spanish” becomes “Spanish male” and “Spanish female” at higher latent widths (Pluth et al., 31 Jan 2025). This suggests that stability may occur more at the concept level than at the exact-coordinate level.

A final controversy concerns the appropriate validation metric. Sparse probing often indicates strong concept alignment for cosine-scored SAEs, yet LLM-based per-feature interpretability rates can be roughly tied with standard SAEs when alive counts are matched (Naihin et al., 13 Jun 2026). This suggests that “easier to label,” “better aligned with benchmark concepts,” and “more interpretable to an automated judge” are not identical properties. The field has not converged on a single definitive metric for feature-label quality.

9. Outlook

Recent work suggests that sparse autoencoder feature labels are best understood not as isolated names attached to latent indices, but as structured hypotheses about latent semantics. The strongest current evidence indicates that labeling quality depends on upstream choices about score geometry, sparsity allocation, anti-absorption mechanisms, and dictionary organization (Naihin et al., 13 Jun 2026). Training procedures such as ATM, OrtSAE, AEN-SAE, Feature Choice, and HSAE all target properties that are upstream of labeling—stability, utilization, atomicity, hierarchy, or concept alignment—rather than text generation of labels themselves (Li et al., 9 Oct 2025).

At the same time, the field is moving from flat, one-feature-at-a-time naming toward richer labeling protocols. These include hierarchical labeling of parent and child features (Luo et al., 12 Feb 2026), relation-aware labeling using geometry and local feature “crystals” (Li et al., 2024), grounding through Grad-FAM or top-activating patches in vision (Flovik, 17 Dec 2025), and matrix-style causal validation that tests a feature across coefficients and joint perturbation conditions (Riegler et al., 4 May 2026). A plausible implication is that future feature-labeling systems will not merely output a phrase, but will attach that phrase to a richer evidential object: activation exemplars, causal response curves, neighborhood relations, and hierarchical placement.

The present consensus is therefore cautious but affirmative. SAE features are often labelable latent units in principle, and recent methods can make them more stable, more atomic, more utilized, or more concept-aligned. But a feature label remains, in most cases, an approximate and method-dependent semantic summary rather than a complete description of a self-contained causal mechanism (Zhang et al., 25 May 2026).