SAE Output Modification with Scores

Updated 17 August 2025

SAE output modification using scores is a framework where quantitative metrics guide feature selection, output adjustment, and model interpretability.
It leverages input and output scores to steer activations, ensuring improved semantic alignment and enhanced fairness through precise interventions.
The approach integrates multi-objective metrics and statistical calibration, achieving two- to three-fold improvements over traditional unsupervised techniques.

SAE output modification using scores refers to the principled use of scoring mechanisms—quantitative metrics or score functions—to select, adjust, reweight, and evaluate the activation or output of sparse autoencoders (SAEs) or broader model components. In contemporary machine learning, scores serve as both optimization objectives and interpretable diagnostic tools for pushing model outputs toward desired properties or controlling their behavior post-training. SAE output modification via scores is now central in applications including model steering, topic alignment, fairness optimization, multi-objective aggregation, and rigorous statistical evaluation.

1. Conceptual Foundations: Scores in SAE Output Control

Scores are quantitative functions designed to measure aspects central to model function—such as interpretability, semantic alignment, output influence, or calibration. In the context of SAE output modification, scores can act as:

Selection criteria—identifying which features or components should be activated, suppressed, or modulated.
Adjustment metrics—determining the magnitude or direction of feature intervention for output steering.
Evaluation measures—quantifying the degree of success or fidelity after modification.

The underlying mathematical structure of a score varies with application: it can be an activation overlap (e.g., input score), intervention effect (e.g., output score), semantic distance, entropy, score change under variable transformation, or a fairness discrepancy.

2. Score-Driven Feature Selection and Steering in SAEs

Recent models of steering in SAEs distinguish “input features” and “output features,” using scores to target the latter for controlling LLM output (Arad et al., 26 May 2025). Two scores central to this paradigm are:

Input Score ( $S_{in}$ ): Measures the token-activation overlap between SAE feature activations and logit lens projections. High $S_{in}$ indicates features tied to input patterns.
Output Score ( $S_{out}$ ): Captures the effect of feature intervention on model output probabilities. Computed as the difference in rank-weighted token probabilities with and without feature intervention:

$S_{\text{out}} = P(\mathcal{M}_{\text{intervened}}) - P(\mathcal{M})$

with $P(\mathcal{M}) = \left(1 - \frac{r(\ell^*, \mathcal{M})}{|V|}\right) p(\ell^*, \mathcal{M})$ .

Empirically, the decoupling between input and output scores enables identification of features that are more effective at steering output (high $S_{out}$ , low $S_{in}$ ), yielding two- to three-fold improvements over unsupervised interventions when filtered appropriately (Arad et al., 26 May 2025).

3. Topic Alignment: Semantic Scoring and Layer-Level Modification

Semantic alignment methodologies utilize scores to quantify neuron-specific affinity to an alignment topic or text (Joshi et al., 14 Jun 2025). The scoring workflow comprises:

Activation summary: For each neuron $i$ , normalize the aggregated activation across prompt tokens:

$\text{summary}(p)_i = \frac{\sum_t \sigma(\gamma(p_t)_i)}{g}$

where $g = \sum_{i,t} \sigma(\gamma(p_t)_i)$ .

Semantic distance: Embed prompts and reference alignment text, compute minimum distance $dist(p, p')$ .
Score aggregation: For neuron $h_i$ ,

$g(h_i) = \frac{\sum_{p \in h_i} \text{summary}(p)_i \cdot dist(p, p')}{\sum_{p \in h_i} \text{summary}(p)_i}$

The final score is normalized between $0$ and $1$.

Score-weighted modification utilizes these neuron scores to emphasize topic-aligned neurons. The “Swap” approach applies context-sensitive selection:

$\sigma'(x) = \gamma(p_t) \odot \left[ \sigma(\gamma(p_t) \odot \text{scores}) \neq 0 \right]$

This construct enables dynamic, token-aware steering towards the alignment topic. The contamination metric quantifies misalignment as:

$\text{contamination} = \gamma(p_t) \times (1 - \text{score}(h_i))$

Experiments confirm the approach enhances language acceptability and reduces training/inference time compared to classical fine-tuning (Joshi et al., 14 Jun 2025).

4. Dense Latents: Activation Density and Antipodality Scores

Dense SAE latents—units that fire on 10–50% of tokens—are analyzed via their activation density and geometric antipodality (Sun et al., 18 Jun 2025). Key scoring mechanisms include:

Antipodality Score ( $s_i$ ):

$s_i := \max_{j \neq i} \left[ \text{sim}(W_e, W_e) \cdot \text{sim}(W_d, W_d) \right]$

where $\text{sim}(u, v)$ is the cosine similarity.

Nullspace Coefficient ( $\alpha_k$ ):

$\alpha_k = \frac{\sum_{j=1}^k U_{-j}^\top W}{\| W \|}$

Dense latents with high $\alpha_k$ regulate entropy via alignment with the quasi-nullspace of the unembedding matrix.

Dense latents are classified into position (context tracking), context-binding, nullspace, alphabet, POS, and PCA categories. Their functional roles shift from structure in early layers to semantic/contextual and output-oriented signals in later layers, and interventions guided by activation or antipodality scores enable fine-grained SAE output control (Sun et al., 18 Jun 2025).

5. Optimization Objectives: Score-Based Training and Fairness

In DNN-based source enhancement, “score” refers to non-differentiable objective metrics (e.g., PESQ, STOI) for perceptual quality. Rather than minimizing MSE, the policy gradient method optimizes the DNN to maximize expected OSQA scores by:

Sampling outputs from the DNN-estimated PDF,
Computing OSQA scores for each sample,
Subtracting the baseline average score,
Updating DNN parameters with a stochastic gradient weighted by score improvement:

$\nabla_\Theta \mathcal{J}(\Theta) \approx \frac{1}{T} \sum_{\tau=1}^T \left[ \frac{1}{K} \sum_{k=1}^K \mathcal{B}(\hat{S}_\tau^{(k)}, X_\tau) \nabla_\Theta \ln p(\hat{S}_\tau^{(k)} | X_\tau, \Theta) \right]$

This framework yields measurable improvements in OSQA metrics, even when MSE is not minimized (Koizumi et al., 2018).

For fairness in probabilistic classification, score transformation modifies outputs $r(x)$ to fair scores $r'(x)$ using a closed-form solution derived via Lagrangian duality. The transformation is parameterized by a vector $\lambda$ encoding fairness corrections:

$r^*(\mu(x); r(x)) = \begin{cases} \dfrac{1 + \mu(x) - \sqrt{(1 + \mu(x))^2 - 4r(x)\mu(x)}}{2\mu(x)}, & \text{if }\mu(x)\neq0 \ r(x), & \text{if }\mu(x)=0 \end{cases}$

Score-based postprocessing or pre-processing yields outputs with bounded fairness disparity and preserves score-based metrics (Brier score, AUC) (Wei et al., 2019).

6. Multi-Objective Score Spaces and Preference Aggregation

Applying the probability integral transform (PIT), raw objectives are mapped into comparable scores:

$S_f(x) = 1 - \text{CDF}_f(f(x))$

This uniform scaling enables aggregation and ordering of multi-objective outputs for SAE systems. Trade-off mappings between preference space and score space are learned via neural networks, accommodating inherent non-linearities. This process improves proximity to desired trade-offs and eliminates auxiliary scaling parameters, facilitating more interpretable and user-aligned output modifications (Hönel et al., 2022).

7. Statistical Calibration and Score Aggregation

The scale-invariant Continuous Ranked Probability Score (CRPS) is reparameterized using the probability integral transform:

$\text{CRPS}(F_U, v) = \frac{v^3}{3} - \frac{(v-1)^3}{3}$

with $v=F_x(y)$ the quantile. The transformed score

$Z(U, v) = 4\,\text{CRPS}(F_U, v) - \frac{1}{3}$

lies in $[0,1]$ and aggregates across variables via convolution:

$S_n = \sum_{i=1}^n Z_i$

Closed-form CDFs using Fresnel integrals enable exact p-value tests for statistical accuracy, transferable to SAE output evaluation. The method is robust to scale and sensitive to median forecast proximity, though insensitive to location bias (Nane et al., 2023).

8. Theoretical Advances: Change of Variables and Flexible Score Matching

Score functions transform under smooth invertible mappings $\mathbf{y} = \phi(\mathbf{x})$ as:

$\nabla_{\mathbf{y}} \log q(\mathbf{y}) = J_{\phi^{-1}(\mathbf{y})}^\top \nabla_{\mathbf{x}} \log p(\mathbf{x}) + \nabla_{\mathbf{x}} \cdot [J_{\phi^{-1}(\phi(\mathbf{x}))}^\top]\bigg|_{\mathbf{x} = \phi^{-1}(\mathbf{y})}$

This enables decoupling the SAE training and generation spaces, guiding output correction via the transformed score gradient. Generalized Sliced Score Matching further extends loss projection from linear to arbitrary smooth functions, capturing nonlinear data structures for more flexible SAE output refinement (Robbins, 2024).

Conclusion

SAE output modification using scores is a multifaceted framework encompassing selection, adaptation, fairness correction, semantic alignment, aggregation, and real-time steering. It harnesses quantitative, interpretable functions—from activation-based criteria and semantic distances to statistical calibration and fairness constraints—for principled intervention in autoencoder models and broader DNNs. Theoretical advances in score transformation, variable change, and matching losses broaden the scope for fine-grained, user-aligned, and performance-competitive output control. Empirical results consistently demonstrate enhancements over baseline and unsupervised methods, while ongoing research continues to refine the targeting and interpretability of these score-driven interventions.