AbsTopK: Bidirectional Sparse Autoencoder

Updated 12 October 2025

AbsTopK is a variant of sparse autoencoders that removes non-negativity constraints by applying hard thresholding to the k largest absolute activations, enabling bidirectional semantic encoding.
It reformulates traditional dictionary learning using a proximal gradient update, leading to improved reconstruction fidelity and a compact latent representation.
Experimental evaluations show that AbsTopK enhances interpretability and semantic steering in LLMs, matching or surpassing supervised methods on various probing tasks.

AbsTopK is a variant of sparse autoencoders (SAEs) designed for representing and interpreting latent bidirectional features in the activations of LLMs. The method emerges from a principled reformulation of SAEs in terms of proximal gradient steps applied to the dictionary learning objective, and addresses a key limitation of conventional SAE regularizers—namely, the enforced non-negativity that fragments semantic axes into separate, redundant features. AbsTopK applies hard thresholding over the k largest-magnitude activations, preserving both positive and negative values and enabling a single factor to encode contrasting semantic concepts. Experimental results demonstrate improved reconstruction fidelity, interpretability, and bidirectional conceptual representation relative to prior SAE variants and even matching or surpassing supervised methods such as Difference-in-Mean on a range of LLMs and probing/steering tasks (Zhu et al., 1 Oct 2025).

1. Dictionary Learning and Proximal Gradient Unrolling

The AbsTopK framework is grounded in the classical dictionary learning objective:

$\min_{D,b} \, \mathbb{E}_{x} \left[ \min_z \frac{1}{2}\|x - (Dz + b)\|_2^2 + \lambda R(z) \right]$

Here, $x$ is a hidden state vector from an LLM, $D$ is a learnable dictionary (or feature matrix), $b$ is a bias vector, $z$ is the latent (sparse code), and $R(z)$ is a sparsity-inducing regularizer. The optimal code $z$ can be obtained (in principle) by iterative algorithms, but the forward computation in SAEs is instead approximated with a single proximal gradient update. For step size $\mu=1$ and initial $z^{(0)} = 0$ , the code update is:

$z^{(1)} = \mathrm{prox}_{\lambda R}(D^\top x - D^\top b)$

In conventional SAEs, this is implemented as an encoder of the form $z = \mathrm{prox}_{\lambda R}(W^\top x + b_e)$ with learnable $W$ and $b_e$ . The choice of $R(z)$ in the prox operator determines the variant.

2. Limitations of Conventional SAE Regularizers

Prominent SAE variants (ReLU SAE, JumpReLU SAE, TopK SAE) rely on regularizers with non-negativity constraints:

ReLU SAE: $R(z) = \|z\|_1 + \iota_{\{z \geq 0\}}(z)$ , with corresponding prox operator $(\max\{u_i - \lambda, 0\})_i$ .
TopK SAE: only the top $k$ nonnegative components are retained; others are set to zero.

This structural constraint enforces that each latent feature can only capture one direction of a semantic axis. For example, to represent “masculine” and “feminine” as axes in feature space, such designs allocate distinct features to each, thereby fragmenting the representation and doubling the number of features needed to cover antonymic or dichotomous semantic axes.

3. AbsTopK: Hard Thresholding in Magnitude

AbsTopK removes the non-negativity constraint and is derived from a pure $\ell_0$ sparsity regularizer. Its encoder is defined by:

$\mathrm{AbsTopK}_k(u)_i = \begin{cases} u_i & \text{if } i \in \displaystyle \mathrm{arg\,top}_k(|u|) \ 0 & \text{otherwise} \end{cases}$

where $u = W^\top x + b_e$ is the preactivation vector, and $\mathrm{arg\,top}_k(|u|)$ denotes the indices of the $k$ largest entries in absolute value. This operator preserves both positive and negative activations, enabling a single latent feature to encode a full semantic axis (e.g., with negative values for “feminine” and positive for “masculine”).

4. Experimental Evaluation: Fidelity and Interpretation

AbsTopK was benchmarked against TopK and JumpReLU SAEs across four LLMs (including GPT-2 Small, Pythia-70M, Gemma2-2B, Qwen3-4B) on seven probing and steering tasks. Key findings include:

Reconstruction Fidelity: AbsTopK achieves lower training MSE and lower normalized reconstruction error, indicating that the sparse latent codes more faithfully reconstruct the original activations.
Interpretability: In qualitative analyses with antonymic sentence pairs (e.g., “man”/“woman”), a single AbsTopK feature shows strong bidirectional responses, in contrast to fragmented encoding in other SAE variants.
Semantic Steering: Manipulating the activation of an AbsTopK feature results in bidirectional control over generative outputs, facilitating interpretable interventions.
Comparison with Supervised DiM: AbsTopK equals or surpasses the Difference-in-Mean method—a label-dependent, supervised probe that is previously reported to outperform unsupervised SAEs—without requiring labeled data for each concept. This suggests that the bidirectional sparsity constraint recovers semantically meaningful axes with higher efficiency.

5. Broader Implications and Applications

AbsTopK’s design closes the gap between unsupervised and supervised interpretability techniques for LLMs. By enabling a compact and bidirectional representation, AbsTopK provides:

Enhanced feature economy: fewer latent units are required to span the conceptual space, since both ends of a semantic axis can inhabit one feature.
Improved downstream utility: feature manipulation and attribution analyses benefit from the ability to target a full spectrum of concepts with a single feature.
Richer interpretability: direct mapping between semantic axes and single latent features simplifies the construction of concept probes and model steering interventions.

A plausible implication is that AbsTopK could become the standard unsupervised method for extracting and manipulating interpretable semantic features in transformer-based LLMs, especially where bidirectionality is crucial.

6. Theoretical and Methodological Significance

The unrolling of proximal gradient descent in the dictionary learning objective provides a unifying perspective for understanding existing SAE architectures as one-step proximal updates with varying regularizers. AbsTopK is, in this context, the canonical application of the hard $\ell_0$ sparsity constraint in the absence of non-negativity. This principled derivation positions AbsTopK as a theoretically grounded baseline for future work on sparse interpretable models in high-dimensional neural representations.

AbsTopK is functionally distinct from recent differentiable top-k in magnitude operators developed for neural network sparsification (see (Sander et al., 2023)). While those approaches enable end-to-end gradient flow for tasks like pruning and routing, AbsTopK is directly optimized for feature interpretability in the context of LLM hidden states, and is structurally simpler—using exact hard thresholding rather than a smooth convex relaxation. This suggests complementary but non-overlapping use cases: AbsTopK for conceptual feature extraction and interpretability, differentiable magnitude top-k for differentiable sparse routing and optimization.

In summary, AbsTopK rethinks the architecture of sparse autoencoders for interpretable feature extraction in LLMs by applying hard magnitude-based sparsity without non-negativity constraints. This enables bidirectional, compact, and interpretable semantic features, bridging gaps between unsupervised and supervised methods and setting a principled foundation for semantic axis discovery and manipulation in neural representations (Zhu et al., 1 Oct 2025).

Markdown Upgrade to Chat

References (2)

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features (2025)

Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AbsTopK.