AbsTopK Sparse Autoencoders

Updated 2 February 2026

The paper introduces AbsTopK SAEs that apply an absolute top-K operator to select bidirectional activations, offering precise control over feature sparsity.
These methods include variants like BatchTopK and Top-AFA, which address limitations in hyperparameter selection and semantic redundancy through adaptive sparsity.
Empirical results show improved reconstruction fidelity, interpretability, and fairness in LLMs, demonstrating the practical benefits of the approach.

AbsTopK SAEs

Sparse autoencoders (SAEs) are a principal tool in mechanistic interpretability, enabling the decomposition of neural network activations—particularly from LLMs—into sparse, human-interpretable features. "AbsTopK SAE" denotes a family of architectures and methods that enforce hard sparsity via selecting features by their largest-magnitude activations (absolute value), without necessarily imposing a non-negativity constraint, and often with additional innovations that address the drawbacks of classic TopK SAEs, including hyperparameter selection, semantic redundancy, and feature selection for control or fairness purposes. The term “AbsTopK” is used in both a strict per-sample top-K-magnitude setting and as a shorthand for related variants such as BatchTopK and top-AFA SAEs, depending on context.

1. Mathematical Definition and Variants

The canonical AbsTopK SAE variant applies an absolute-value top-K operator to the latent code produced by the encoder. For a sample activation $x\in\mathbb{R}^d$ and encoder $W_{\mathrm{enc}}$ :

Pre-activation: $u = W_{\mathrm{enc}}^\top x + b_e \in \mathbb{R}^k$
AbsTopK projection:

$\text{AbsTopK}(u, K)_i = \begin{cases} u_i, & \text{if } i \in H_K(u)\ 0, & \text{otherwise} \end{cases}$

where $H_K(u)$ is the set of indices of the $K$ largest $|u_i|$

This hard-thresholding operator preserves both positive and negative high-magnitude components, ensuring bidirectionality of features—i.e., a single learned feature can represent opposing conceptual directions (“man” versus “woman”) by the sign of its activation (Zhu et al., 1 Oct 2025).

Typical decoding proceeds via $x̂ = W_{\mathrm{dec}} z + b_{\mathrm{dec}}$ with $z = \text{AbsTopK}(u, K)$ .

Variants include:

BatchTopK: Instead of enforcing top-K per-token, a global batch-level top-K (among all pre-activations in a batch) is selected, exactly allocating a fixed average sparsity budget while permitting per-example adaptivity (Bussmann et al., 2024).
Sampled-SAE: Interpolates between BatchTopK and AbsTopK by restricting per-sample top-K selection to a candidate feature pool defined by batch-wide feature statistics (e.g., $\ell_2$ norm) (Oozeer et al., 29 Aug 2025).
Top-AFA: Eliminates hyperparameter $K$ entirely, choosing the minimal number of active features whose decoder-weighted $\ell_2$ norm matches the input norm per an explicit quasi-orthogonality-derived criterion (Lee et al., 31 Mar 2025).

2. Theoretical Basis and Motivations

AbsTopK SAEs are motivated by several theoretical and empirical findings:

Linear Representation Hypothesis (LRH): Neural activations in deep models can be recast as (typically high-dimensional) sparse linear expansions: $z(x) = W f$ , with $f$ sparse (Lee et al., 31 Mar 2025).
Superposition Hypothesis (SH): The dictionary of features $h$ can (and should) substantially exceed embedding size $d$ , producing an overcomplete representation: $h > d$ .
Problem of Heuristic $k$ : Classic TopK enforces a fixed per-example $k$ , which has no theoretical justification, and can lead to under- or over-activation of features depending on input complexity and actual dictionary geometry.
Quasi-Orthogonality and Norm Preservation: For nearly-orthogonal decoders (small off-diagonals in $D^\top D$ ), the $\ell_2$ norm of the encoded feature $f$ is well-approximated by $\|z\|$ (the dense input norm), with explicit error bounds. This justifies norm-matching constraints in activation selection and informs new architectures (e.g., top-AFA) (Lee et al., 31 Mar 2025).

3. Algorithmic Implementations

AbsTopK Forward Pass Pseudocode

def AbsTopK_Encode(x, W, b_e, k):
    u = W.T @ x + b_e            # Pre-activation
    a = np.abs(u)
    H = argpartition(a, -k)[-k:] # Indices of top-k |u_i|
    z = np.zeros_like(u)
    z[H] = u[H]                  # Preserve sign
    return z

(Zhu et al., 1 Oct 2025)

BatchTopK

Given a batch $Z \in \mathbb{R}^{m \times n}$ of $n$ samples:

Flatten $Z$ , keep the top $n \cdot k$ entries.
For each sample, $f_{i,j} = z_{i,j}$ if $z_{i,j}\geq \tau_{\mathrm{batch}}$ , otherwise $0$, where $\tau_{\mathrm{batch}}$ is the $(n k)$ -th largest entry over the batch (Bussmann et al., 2024).

Top-AFA SAE

Compute per-feature contributions ( $s_i$ ) as $f_i \cdot \|D_{:,i}\|_2$ , sort, accumulate.
Set $k^*$ to the minimal $k$ such that $\sqrt{C_k} \approx \|z\|$ (norm-matching).
Mask out all but top $k^*$ features.

Loss includes both reconstruction and an auxiliary AFA loss enforcing $\|f\|_2 \approx \|z\|_2$ (Lee et al., 31 Mar 2025).

4. Architectural and Semantic Consequences

Bidirectionality

AbsTopK preserves both signs, enabling single features to encode conceptual axes. This removes the redundancy imposed by non-negative-only schemes (TopK, JumpReLU), where concepts like “male” and “female” would require two features instead of one bidirectional axis. This design enables richer, more compact representations (Zhu et al., 1 Oct 2025).

Adaptivity and Interpretability

BatchTopK: Directly specifies average sparsity, allocating more latent “budget” to information-rich tokens, less to trivial ones.
Top-AFA: Input-dependent $k$ selection, with no global hyperparameter, gives adaptive per-sample sparsity, aligning the activated feature norm to the input’s actual norm per encoding geometry (Lee et al., 31 Mar 2025).
Sampled-SAE: Allows trade-off between global feature consistency and token-specific expressivity, controlled by a candidate pool multiplier $\ell$ (Oozeer et al., 29 Aug 2025).

Feature Selection and Control

Unconventional selection strategies (sometimes also called AbsTopK), operate directly on encoder feature salience metrics (e.g., learned probe coefficients, Wasserstein distances), projecting attributes for control or debiasing independently from decoder weight statistics (Bărbălau et al., 13 Sep 2025).

5. Empirical Performance and Comparison to Alternatives

Experiments demonstrate consistent empirical improvements of AbsTopK and its generalizations over classical TopK and JumpReLU SAEs:

Reconstruction fidelity: Lower normalized MSE and higher loss-recovered rates across four LLMs for AbsTopK (Zhu et al., 1 Oct 2025), and in outperforming TopK on both standard and domain-specific applications (e.g., protein LLMs (Haque et al., 5 Dec 2025)).
Interpretability: Higher mean disentanglement scores (SCR, TPP, Sparse Probing), fewer redundant, fragmented features, and enhanced conceptual alignment (Zhu et al., 1 Oct 2025).
Downstream utility: No loss in main-task performance metrics (e.g., MMLU, HarmBench) while achieving higher fairness in retrieval and attribute control tasks (Bărbălau et al., 13 Sep 2025).
Steering and control: When paired with feature ranking strategies based on output effect (e.g., $S_\text{out}$ ), top AbsTopK features enable effective linear steering, equalling or surpassing strong supervised methods (Difference-in-Mean, LoRA) (Arad et al., 26 May 2025).
Parameter efficiency: AbsTopK achieves higher performance at lower feature dimensionalities due to feature compactness (Zhu et al., 1 Oct 2025).

The table below summarizes select empirical outcomes:

Variant	Reconstruction (MSE ↓)	Interpretability (e.g. SCR ↓)	Bidirectional?	Need to tune $k$ ?
TopK	Moderate	Good in simple cases	No	Yes
BatchTopK	Lower	Comparable or better	No	Yes (mean only)
AbsTopK	Lowest	Highest, compact axes	Yes	Yes
Top-AFA	Lowest (or best)	Highest, most adaptive	No	No

*Lower values denote better performance where indicated.

6. Applications: Steering, Debiasing, and Domain-Specific Uses

Steering LLMs: Ranking AbsTopK features by downstream output change enables concept-level steering with superior controllability and interpretability (Arad et al., 26 May 2025).
Debiasing VLMs: The Select-and-Project TopK (SₚTopK) framework selects encoder features encoding protected attributes and removes them by constructing and projecting out linear axes, achieving 3.2× improvements in fairness metrics relative to decoder-based masking (Bărbălau et al., 13 Sep 2025).
Biological Sequence Modelling: In settings such as antibody LLMs, TopK SAEs efficiently extract highly interpretable, residue-level features but may fragment higher-order concepts, whereas Ordered SAEs provide causal steering for more abstract concepts (Haque et al., 5 Dec 2025).

7. Practical Considerations, Limitations, and Outlook

Selection of $k$ : While AbsTopK variants improve flexibility, traditional TopK still requires heuristic $k$ -selection. BatchTopK and top-AFA mitigate this via adaptive or theoretically grounded approaches (Bussmann et al., 2024, Lee et al., 31 Mar 2025).
Cross-layer sparsity: Standard AbsTopK/TopK trained layerwise can result in feature duplication and dense circuits across layers, which can be mitigated by more advanced architectures (e.g., Staircase SAEs, JSAEs) (Fillingham et al., 10 Nov 2025).
Activation lottery and absorption: Large candidate pools in batch-based variants risk rare feature dominance, diluting interpretability; tuning the candidate pool size ( $\ell$ ) in Sampled-SAE allows fine-grained trade-off (Oozeer et al., 29 Aug 2025).
Steerability vs. interpretability: Classical TopK is effective for domain concept mapping but less reliable for causal sequence control. New hierarchical designs (Ordered SAEs) and bidirectional AbsTopK further improve on this axis (Haque et al., 5 Dec 2025, Zhu et al., 1 Oct 2025).

A plausible implication is that AbsTopK-style selection will remain foundational but will increasingly be incorporated into hybrid or adaptive mechanisms driven by theoretical insights from quasi-orthogonality and empirical results in downstream control, fairness, and interpretability benchmarks.

Markdown Upgrade to Chat

References (8)

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features (2025)

BatchTopK Sparse Autoencoders (2024)

Distribution-Aware Feature Selection for SAEs (2025)

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality (2025)

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone (2025)

Mechanistic Interpretability of Antibody Language Models Using SAEs (2025)

SAEs Are Good for Steering -- If You Select the Right Features (2025)

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AbsTopK SAEs.