AbsTopK: Bidirectional Sparse Autoencoder
- AbsTopK is a variant of sparse autoencoders that removes non-negativity constraints by applying hard thresholding to the k largest absolute activations, enabling bidirectional semantic encoding.
- It reformulates traditional dictionary learning using a proximal gradient update, leading to improved reconstruction fidelity and a compact latent representation.
- Experimental evaluations show that AbsTopK enhances interpretability and semantic steering in LLMs, matching or surpassing supervised methods on various probing tasks.
AbsTopK is a variant of sparse autoencoders (SAEs) designed for representing and interpreting latent bidirectional features in the activations of LLMs. The method emerges from a principled reformulation of SAEs in terms of proximal gradient steps applied to the dictionary learning objective, and addresses a key limitation of conventional SAE regularizers—namely, the enforced non-negativity that fragments semantic axes into separate, redundant features. AbsTopK applies hard thresholding over the k largest-magnitude activations, preserving both positive and negative values and enabling a single factor to encode contrasting semantic concepts. Experimental results demonstrate improved reconstruction fidelity, interpretability, and bidirectional conceptual representation relative to prior SAE variants and even matching or surpassing supervised methods such as Difference-in-Mean on a range of LLMs and probing/steering tasks (Zhu et al., 1 Oct 2025).
1. Dictionary Learning and Proximal Gradient Unrolling
The AbsTopK framework is grounded in the classical dictionary learning objective:
Here, is a hidden state vector from an LLM, is a learnable dictionary (or feature matrix), is a bias vector, is the latent (sparse code), and is a sparsity-inducing regularizer. The optimal code can be obtained (in principle) by iterative algorithms, but the forward computation in SAEs is instead approximated with a single proximal gradient update. For step size and initial , the code update is:
In conventional SAEs, this is implemented as an encoder of the form with learnable and . The choice of in the prox operator determines the variant.
2. Limitations of Conventional SAE Regularizers
Prominent SAE variants (ReLU SAE, JumpReLU SAE, TopK SAE) rely on regularizers with non-negativity constraints:
- ReLU SAE: , with corresponding prox operator .
- TopK SAE: only the top nonnegative components are retained; others are set to zero.
This structural constraint enforces that each latent feature can only capture one direction of a semantic axis. For example, to represent “masculine” and “feminine” as axes in feature space, such designs allocate distinct features to each, thereby fragmenting the representation and doubling the number of features needed to cover antonymic or dichotomous semantic axes.
3. AbsTopK: Hard Thresholding in Magnitude
AbsTopK removes the non-negativity constraint and is derived from a pure sparsity regularizer. Its encoder is defined by:
where is the preactivation vector, and denotes the indices of the largest entries in absolute value. This operator preserves both positive and negative activations, enabling a single latent feature to encode a full semantic axis (e.g., with negative values for “feminine” and positive for “masculine”).
4. Experimental Evaluation: Fidelity and Interpretation
AbsTopK was benchmarked against TopK and JumpReLU SAEs across four LLMs (including GPT-2 Small, Pythia-70M, Gemma2-2B, Qwen3-4B) on seven probing and steering tasks. Key findings include:
- Reconstruction Fidelity: AbsTopK achieves lower training MSE and lower normalized reconstruction error, indicating that the sparse latent codes more faithfully reconstruct the original activations.
- Interpretability: In qualitative analyses with antonymic sentence pairs (e.g., “man”/“woman”), a single AbsTopK feature shows strong bidirectional responses, in contrast to fragmented encoding in other SAE variants.
- Semantic Steering: Manipulating the activation of an AbsTopK feature results in bidirectional control over generative outputs, facilitating interpretable interventions.
- Comparison with Supervised DiM: AbsTopK equals or surpasses the Difference-in-Mean method—a label-dependent, supervised probe that is previously reported to outperform unsupervised SAEs—without requiring labeled data for each concept. This suggests that the bidirectional sparsity constraint recovers semantically meaningful axes with higher efficiency.
5. Broader Implications and Applications
AbsTopK’s design closes the gap between unsupervised and supervised interpretability techniques for LLMs. By enabling a compact and bidirectional representation, AbsTopK provides:
- Enhanced feature economy: fewer latent units are required to span the conceptual space, since both ends of a semantic axis can inhabit one feature.
- Improved downstream utility: feature manipulation and attribution analyses benefit from the ability to target a full spectrum of concepts with a single feature.
- Richer interpretability: direct mapping between semantic axes and single latent features simplifies the construction of concept probes and model steering interventions.
A plausible implication is that AbsTopK could become the standard unsupervised method for extracting and manipulating interpretable semantic features in transformer-based LLMs, especially where bidirectionality is crucial.
6. Theoretical and Methodological Significance
The unrolling of proximal gradient descent in the dictionary learning objective provides a unifying perspective for understanding existing SAE architectures as one-step proximal updates with varying regularizers. AbsTopK is, in this context, the canonical application of the hard sparsity constraint in the absence of non-negativity. This principled derivation positions AbsTopK as a theoretically grounded baseline for future work on sparse interpretable models in high-dimensional neural representations.
7. Comparison with Related Operators
AbsTopK is functionally distinct from recent differentiable top-k in magnitude operators developed for neural network sparsification (see (Sander et al., 2023)). While those approaches enable end-to-end gradient flow for tasks like pruning and routing, AbsTopK is directly optimized for feature interpretability in the context of LLM hidden states, and is structurally simpler—using exact hard thresholding rather than a smooth convex relaxation. This suggests complementary but non-overlapping use cases: AbsTopK for conceptual feature extraction and interpretability, differentiable magnitude top-k for differentiable sparse routing and optimization.
In summary, AbsTopK rethinks the architecture of sparse autoencoders for interpretable feature extraction in LLMs by applying hard magnitude-based sparsity without non-negativity constraints. This enables bidirectional, compact, and interpretable semantic features, bridging gaps between unsupervised and supervised methods and setting a principled foundation for semantic axis discovery and manipulation in neural representations (Zhu et al., 1 Oct 2025).