Task-Conditioned Sparse Autoencoder

Updated 18 March 2026

Task-Conditioned Sparse Autoencoders are overcomplete models that enforce sparsity to produce interpretable, monosemantic latent features for analyzing and steering neural network behaviors.
They integrate statistical filtering and optimization techniques, such as ANOVA tests, Pearson correlation, and ℓ1 regularization, to select features aligned with downstream objectives.
Empirical studies show that these SAEs improve prediction accuracy and enable effective causal interventions, offering robust tools for model interpretability and control in large language models.

A Task-Conditioned Sparse Autoencoder (SAE) is a high-dimensional, overcomplete autoencoder architecture, often applied atop neural network or LLM activations, in which activation sparsity is explicitly or implicitly encouraged to yield monosemantic latent dimensions. When deployed for interpretability, steering, or downstream prediction, these latents act as a sparse feature basis for decomposing or controlling model behaviors. A system is “task-conditioned” if specific features or steering coefficients are selected or optimized based on their statistical relationship to a defined downstream objective, such as a supervised label or correctness indicator, rather than through fully unsupervised pretraining alone. The resulting pipelines enable feature selection, causal intervention, and fine-grained analysis of the internal computation in LLMs and prediction architectures.

1. Architecture and Training of Sparse Autoencoders

At the core of SAE-enabled pipelines is an overcomplete autoencoder, with latent dimensionality $k \gg d$ relative to the input dimension $d$ , enabling the extraction of a rich, fine-grained sparse feature vocabulary. Given a residual-stream or attention-output vector $x \in \mathbb{R}^d$ from a transformer layer, the encoder and decoder are structured as:

Encoder: $z = \sigma(W_{\mathrm{enc}}\,x + b_{\mathrm{enc}}) \in \mathbb{R}^k$ , where $\sigma$ may be a nonlinearity such as ReLU, TopK-ReLU, JumpReLU, or a soft-gated variant; and $W_{\mathrm{enc}} \in \mathbb{R}^{k \times d}$ .
Sparsity Enforcement: This is achieved either by explicit $\ell_1$ penalty in the loss (as in CorrSteer and ICL circuit analysis) or structurally via topology, e.g., enforcing exactly $K$ nonzero components with TopK-ReLU in SAELens/SAE-FiRE (Zhang et al., 20 May 2025, Cho et al., 18 Aug 2025, Kharlapenko et al., 18 Apr 2025).
Decoder: $\hat{x} = W_{\mathrm{dec}}\,z + b_{\mathrm{dec}} \in \mathbb{R}^d$ .

The canonical SAE loss is: $\mathcal{L}_{\mathrm{SAE}} = \|x - \hat{x}\|_2^2 + \lambda\,\|z\|_1,$ with $d$ 0 controlling the sparsity-accuracy tradeoff (Kharlapenko et al., 18 Apr 2025, Cho et al., 18 Aug 2025).

Table 1: Key architectural parameters (SAE-FiRE example) | SAE Variant | Input Dim $d$ 1 | Latent Dim $d$ 2 | Sparsity Mechanism | |----------------------|--------------|---------------|----------------------------| | Gemma2-2B SAE | variable | 16,000 | TopK-ReLU, $d$ 3 | | Gemma2-9B SAE | variable | 131,000 | TopK-ReLU, $d$ 4 |

The SAE is typically trained on massive, unsupervised datasets of model activations, with hyperparameters (including $d$ 5, $d$ 6, or $d$ 7) chosen to activate only a small ( $d$ 8– $d$ 9%) fraction of latents per input (Kharlapenko et al., 18 Apr 2025).

2. Task Conditioning: Selection and Interpretation of Sparse Features

Task conditioning in sparse autoencoder workflows refers to the process of selecting, interpreting, or constructing features whose relevance is specific to a predefined task or downstream statistical target, rather than utilizing all SAE latents indiscriminately. Current methodologies include:

Statistical Filtering: In SAE-FiRE, after projecting input sequences to the sparse code, a post-hoc selection retains only the top- $x \in \mathbb{R}^d$ 0 SAE latents showing the strongest association with the prediction label, as computed by ANOVA F-test or tree-based importance metrics (Zhang et al., 20 May 2025).
Correlation-based Selection: CorrSteer implements fully automated feature selection by computing the Pearson correlation between each SAE feature’s activation and the task correctness signal, keeping only those latents with highest positive correlation (Cho et al., 18 Aug 2025).
Optimization-driven Decomposition: In ICL circuit studies, a “task vector” $x \in \mathbb{R}^d$ 1 (e.g., the average activation associated with a behavioral prompt type) is decomposed into a minimal sparse sum of SAE latents via an $x \in \mathbb{R}^d$ 2-regularized loss to recover task-relevant directions $x \in \mathbb{R}^d$ 3, subject to fidelity constraints in model behavior (Kharlapenko et al., 18 Apr 2025).

In all cases, these mechanisms aim to prune the high-dimensional latent space to a task-relevant, interpretable, and computationally tractable subset of features.

3. Downstream Use: Steering, Classification, and Causal Analysis

Downstream uses of task-conditioned SAE features bifurcate into two principal domains:

Prediction and Classification: Features selected from SAE latents are passed to logistic regression or multilayer perceptron classifiers, as in SAE-FiRE, to predict financial outcomes such as earnings surprises. Here, a strong statistical association between selected features and target labels is shown to improve accuracy and AUC relative to baseline vectorizations (Zhang et al., 20 May 2025).
Steering and Causal Intervention: Selected SAE features can be used to generate “steering vectors” injected into the model’s residual stream to modify behavior in a controlled manner. CorrSteer computes scaling coefficients for each selected feature based on their mean activation on correct samples, producing steering vectors $x \in \mathbb{R}^d$ 4 and modifying $x \in \mathbb{R}^d$ 5 (Cho et al., 18 Aug 2025). In ICL feature circuit analysis, sparse sums of SAE latents produce task vectors or targeted subcircuit interventions for interpretability and behavior control (Kharlapenko et al., 18 Apr 2025).
Faithfulness and Specificity: Success in downstream control is evaluated by ablation or insertion studies, measuring the effect of feature intervention on metrics like model accuracy, negative log-likelihood, AUC, and the faithfulness ratio.

4. Experimental Validation and Comparative Performance

Empirical findings consistently show the practical value of task-conditioned SAE pipelines:

Prediction (SAE-FiRE): On the earnings surprise task, SAE-FiRE outperforms baselines. For Gemma2-2B (1500 features), accuracy is 0.793 vs. 0.761 (last hidden state), with AUC improvement from 0.628 to 0.657. Larger SAEs and more features yield even better results (Zhang et al., 20 May 2025).
Steering (CorrSteer): CorrSteer demonstrates task performance improvement (e.g., +4.1% on MMLU, +22.9% on HarmBench) while maintaining lower side-effect ratios than contrastive or tuning-based methods. This suggests the importance of positive-correlation feature selection and the effectiveness of static, sparse steering on some tasks (Cho et al., 18 Aug 2025).
ICL Feature Circuits: Task decomposition via SAE yields a sparse, interpretable set of execution and detection latents, with ablations/interventions verifying causality. For simple tasks, steering with just 3–5 latents retains nearly all of the full task vector’s behavioral effect (Kharlapenko et al., 18 Apr 2025).

Ablation studies reinforce that post-hoc statistical feature selection (vs. using all latents) substantially increases classification AUC and reduces extraneous activations. Middle transformer layers are empirically optimal for extracting high-utility SAE features (Zhang et al., 20 May 2025).

5. Methodological Limitations and Interpretability Implications

Inductive Bias and Pretraining: SAE feature interpretability and specificity depend on the autoencoder’s architectural choices and pretraining corpus. Sparse solutions may omit subtle behaviors if $x \in \mathbb{R}^d$ 6 or the activation constraint is too tight (Kharlapenko et al., 18 Apr 2025).
Static Steering Constraints: Static, per-layer additions may fail to propagate optimal information for every task. For example, reasoning benchmarks requiring adaptive, mid-sequence behavior are not substantially improved by static CorrSteer interventions (Cho et al., 18 Aug 2025).
Task Simplicity: Most causal and interpretability results rely on simple, token-to-token tasks (e.g., arrow-mapping); generalization to structured or generative tasks remains unproven (Kharlapenko et al., 18 Apr 2025).
Scalability: While streaming approaches (CorrSteer) offer $x \in \mathbb{R}^d$ 7 memory selection, full end-to-end circuit tracing across depth is not yet tractable for the largest model scales (Kharlapenko et al., 18 Apr 2025).

Notwithstanding these limitations, SAE-based pipelines provide uniquely interpretable, monosemantic features that align with human concepts (“neutrality in discourse,” “legal refusal,” or ICL task execution) and can underpin direct, causally grounded model interventions.

6. Broader Impact and Future Directions

SAEs, when combined with principled task conditioning, form a versatile interface between black-box neural representations and statistical, symbolic, or causal downstream analysis.

Broader implications include:

Rapid, interpretable out-of-distribution model control with minimal risk of collateral capability degradation (Cho et al., 18 Aug 2025).
Improved feature selection for prediction under heavy redundancy and terminology drift (e.g., financial text) (Zhang et al., 20 May 2025).
Demonstrated utility for circuit-based decomposition and mechanistic analysis of complex behaviors such as in-context learning, with clear paths toward extending such analysis to larger models and more complex tasks (Kharlapenko et al., 18 Apr 2025).

Future directions include dynamic steering, orthogonal feature projection to further reduce side effects, and scaling interpretability methodologies to cover longer context windows, creative behaviors, and larger architectures.

References:

(Zhang et al., 20 May 2025): SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection (Cho et al., 18 Aug 2025): CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection (Kharlapenko et al., 18 Apr 2025): Scaling sparse feature circuit finding for in-context learning