AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Published 1 Oct 2025 in cs.LG, cs.AI, and cs.CL | (2510.00404v2)

Abstract: Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of LLMs, aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

Abstract PDF Chat (Pro)

Summary

The paper introduces AbsTopK, a sparse autoencoder that leverages proximal gradient methods to remove non-negativity constraints and encode bidirectional features.
The method applies an ℓ0 norm-based thresholding to preserve both positive and negative activations, achieving improved error metrics and model fidelity.
Empirical results on models like Qwen3-4B underscore AbsTopK's superiority in handling steering tasks and enhancing semantic interpretability.

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Introduction

The paper introduces a novel framework for designing Sparse Autoencoders (SAEs) by leveraging the proximal gradient method for sparse coding. Traditional SAEs like ReLU, JumpReLU, and TopK have exhibited limitations in capturing bidirectional semantic concepts due to non-negativity constraints. The proposed AbsTopK seeks to overcome these limitations by applying hard thresholding on the largest-magnitude activations, preserving both positive and negative activations. This abstraction allows richer conceptual representations and aligns more effectively with the inherent bidirectionality of many latent semantic concepts in LLMs.

Methodology

The authors propose a unified framework exploiting proximal operators to derive SAEs from the dictionary learning problem. This approach establishes a formal connection between sparsity-inducing regularizers and activation functions commonly used in SAEs.

Figure 1: AbsTopK enables single latent features to encode opposing concepts by leveraging both positive and negative activations.

Proximal Gradient Method

The proximal gradient method is applied to the sparse coding problem, yielding the proximal operator: $z^{(1)} = \operatorname{prox}_{\lambda R}(W^\top x + b_{\text{e}})$ where $R$ is a sparsity-inducing regularizer. This step leads to a single-layer neural network instantiation, characteristic of SAEs. The newly proposed AbsTopK removes the non-negativity constraint inherent in other SAEs by leveraging an $\ell_0$ norm-based thresholding method, simplifying the integration of bidirectional semantic features.

Experiments

The empirical validation involved comprehensive testing across four LLMs, including Qwen3 4B and Gemma2-2B, using AbsTopK, TopK, and JumpReLU variants. Metrics such as MSE training loss, normalized reconstruction error, and cross-entropy loss recovery were employed to evaluate the approaches.

Figure 2: Performance comparison of JumpReLU, TopK, and AbsTopK SAEs on Qwen3 4B Layer 20, illustrating significant error reduction and model fidelity preservation.

Figure 3: Performance comparison of SAE variants across tasks on Qwen3-4B Layer 18, indicating overall superior task performance by AbsTopK.

AbsTopK consistently outperformed other variants, showcasing enhanced representational capacity and interpretability by allowing a single feature to compactly encode contrasting semantic directions (e.g., male vs. female).

Results on Steering and Probing Tasks

AbsTopK was assessed on steering tasks, demonstrating its adeptness in controlled intervention within LLMs without severely compromising broader model competencies. This contrasts with simpler methods like Difference-in-Means (DiM), which typically necessitate labeled datasets and fail to capture multidimensional and bidirectional features inherent in LLMs efficiently.

Implications and Future Work

The introduction of AbsTopK advances the field of interpretable LLMs by enabling more coherent feature representations that align with the natural bidirectionality of semantic concepts. Future research could explore multi-step proximal updates to capture finer-grained semantic structures and optimize the computational efficiency of such enrichments in large-scale models.

Conclusion

AbsTopK addresses fundamental limitations in existing SAEs by eliminating the non-negativity constraint, thereby facilitating the identification and manipulation of bidirectional features within LLMs. Its efficacy across multiple models demonstrates its potential to augment the interpretability and control of contemporary AI systems, encouraging developments towards more nuanced and broader spectrum feature encodings. The research provides a groundwork for further exploration into efficient proximal algorithms and their applicability to various domains beyond natural language processing.