Sensitivity-Scaled Steering (SSS)

Updated 28 November 2025

Sensitivity-Scaled Steering (SSS) is a white-box activation-space attack method that targets decoder-only LLMs to induce substantial behavioral shifts.
It exploits high-gain regions, compression valleys, and BOS anchoring to amplify small, well-aligned perturbations effectively.
The technique achieves efficient perturbation allocation while maintaining output fluency, posing significant security risks and prompting defensive research.

Sensitivity-Scaled Steering (SSS) is a progressive white-box activation-space attack methodology targeting decoder-only LLMs. SSS exploits high-gain regions in the model’s residual stream that are causally sensitive to small, well-aligned perturbations, leveraging architectural phenomena such as attention sinks and compression valleys. Designed to induce substantial shifts in behaviors—such as sycophancy, hallucination, malicious intent, and sentiment—while maintaining output fluency and coherence, SSS introduces a security risk for both white-box and supply-chain LLM deployments by rendering traditionally overlooked activation-space an actionable attack surface (Xu et al., 21 Nov 2025).

1. Objectives and Conceptual Overview

SSS is predicated on steering model behavior by injecting small vectors into the hidden-state residual stream at one or more layers. The attack proceeds progressively, beginning with the injection of a seed perturbation at the beginning-of-sequence (BOS) token within a compression-valley layer—a mechanism termed BOS anchoring. Subsequent tokens are reinforced according to their local model sensitivity, enabling focused application of a limited perturbation budget. The primary objective is to realize pronounced behavioral drift across explicit axes (sycophancy, hallucination, “evil” persona, and sentiment polarity), without compromising the coherence of generated text.

2. Theoretical Foundations

2.1 Autoregressive Residual Updates and Causal Amplification

In decoder-only LLMs, the residual state of token $i$ at layer $l$ is updated as

$h_i^{(l)} = h_i^{(l-1)} + \mathcal F^{(l)}(h_{\le i}^{(l-1)}),$

where $\mathcal F^{(l)}$ denotes the combined attention and MLP update. Within this autoregressive scheme, small perturbations introduced at a prior token propagate to subsequent tokens according to the Jacobian $J_i^{(l)} = \partial h_i^{(l)}/\partial h_{i-1}^{(l)}$ , governing local amplification via its spectral norm $\|J_i^{(l)}\|_2$ .

2.2 Compression Valleys and BOS Attention Sinks

Intermediate (mid-depth) layers in transformer architectures often exhibit compression valleys, wherein activations converge toward low-rank (near–rank-one) subspaces dominated by BOS attention sinks. The BOS token accumulates significant attention, anchoring the direction along which perturbations can be efficiently propagated and amplified. Perturbations that are well-aligned with this dominant direction exhibit pronounced causal amplification, resulting in cumulative behavioral effects that are disproportionately large compared to the initial perturbation magnitude.

3. Algorithmic Structure of SSS

3.1 Steering Direction Extraction

The steering direction $v^{(l)}_\mathrm{steer}$ is obtained by analyzing a contrastive prompt dataset $D_\mathrm{dir} = \{(x^+, x^-)\}$ , collecting mean hidden states for “positive” and “negative” responses at layer $l$ . The candidate direction is computed via

$d^{(l)} = \overline{U^{(l)}} - \overline{V^{(l)}}$ (difference of means),
$p^{(l)} = \mathrm{PCA}_1(U^{(l)} \cup V^{(l)})$ (top principal component), combined as

$v^{(l)}_\mathrm{steer} = \frac{1}{2} \left(d^{(l)} + p^{(l)}\right),$

with normalization to unit norm. The optimal layer $l^*$ is selected in the 20%–85% depth range based on maximizing observed behavioral shift $\Delta S$ .

3.2 Sensitivity Signal Computation

For each token $i$ at the chosen layer $l^*$ :

Compute top singular vector $v_{\max}$ of $J_i^{(l^*)}$ .
Compute correlation $\rho_i$ with the steering direction:

$\rho_i = \frac{\langle v_\mathrm{steer}, v_{\max}\rangle}{\|v_\mathrm{steer}\|_2 \|v_{\max}\|_2} \in [-1, 1].$

Directional gain:

$g_i = \|J_i^{(l^*)} v_\mathrm{steer}\|_2, \quad \gamma_i = \frac{g_i}{\sigma_{\max}(J_i^{(l^*)}) \|v_\mathrm{steer}\|_2} \in [0, 1].$

3.3 BOS Anchoring and Adaptive Reinforcement

A seed perturbation is injected at BOS ( $i=0$ ):

$\delta_\mathrm{BOS} = \psi(-\rho_0) \hat v_\mathrm{steer}, \quad \tilde h^{(l^*)}_0 = h^{(l^*)}_0 + \delta_\mathrm{BOS},$

where $\psi$ is a smooth sigmoid and $\hat v_\mathrm{steer}$ is the normalized steering direction.

For each subsequent token $i \geq 1$ :

$\delta_i = \psi(\gamma_i - \rho_i)\, \hat v_\mathrm{steer}, \quad \tilde h^{(l^*)}_i = h^{(l^*)}_i + \delta_i.$

Larger micro-injections are applied where the model is both highly amplifiable ( $\gamma_i$ large) and insufficiently aligned ( $\rho_i$ small), constituting adaptive reinforcement.

Pseudocode Summary

for each token position i:
    J_i = ∂h_i^(l*) / ∂h_{i-1}^(l*)
    v_max = top singular vector of J_i
    ρ_i = (v_steer ⋅ v_max) / (‖v_steer‖‖v_max‖)
    g_i = ‖J_i v_steer‖
    γ_i = g_i / (σ_max(J_i) · ‖v_steer‖)

δ_0 = ψ(–ρ_0) · (v_steer/‖v_steer‖)
h̃_0 = h_0 + δ_0

δ_i = ψ(γ_i – ρ_i) · (v_steer/‖v_steer‖)
h̃_i = h_i + δ_i
continue decoding with h̃_i fed to layer l*+1

4. Perturbation Budget and Efficiency

Each micro-injection $\delta_i$ has norm $\leq 1$ due to direction normalization and $\psi$ sigmoid scaling. SSS avoids the inefficiencies of fixed-coefficient methods, which can result in either undershooting (insufficient behavioral effect) or overshooting (loss of coherence and fluency), by focusing the limited perturbation amplitude where the Jacobian-based sensitivity is maximal. This budget-efficient allocation preserves baseline fluency and semantic coherence while enabling robust steering along the designated behavior axis.

5. Empirical Evaluation

5.1 Models and Behavioral Axes

SSS was empirically validated on several open-weight models:

Qwen3-14B (dense, CoT)
Llama-3.1-8B-Instruct (dense, non-CoT)
DeepSeek-R1-7B (dense, CoT)
GPT-OSS-20B (MoE, CoT)

Behavioral axes evaluated include:

Evil (Persona Evil)
Hallucination (TruthfulQA, Persona Hallucination)
Sycophancy (reward-hacking prompts)
Beshift (sentiment drift on reviews)

5.2 Metrics and Quantitative Outcomes

Key metrics:

Behavior Score (0–100, LLM-as-Judge, Gemini 2.5 Flash), with $\Delta S = |S_\mathrm{post} - S_\mathrm{pre}|$
Directional Projection (DP): per-token inner product with steering vector $r$
Turning-point $t^*$ : first position with 100-token mean DP crossing from negative to positive
Coherence Score (0–100, fluency)

Comparison on Qwen3-14B (metrics averaged across samples):

Behavior Axis	Benign (B)	SSS (Beh, Coh)	ActAdd	CAA	Human-Prompt
Beshift	(2.9, 89.5)	(82.4, 86.1)	(59.7, 79.9)	(77.3, 76.4)	(10.2, 85.5)
Evil	(0.0, 88.2)	(73.9, 87.6)	(35.7, 80.6)	(81.4, 73.6)	(0.0, 86.1)
Hallucination	(9.9, 84.4)	(54.5, 86.8)	(46.3, 79.8)	(60.2, 82.8)	(18.5, 89.0)
Sycophancy	(22.1,88.9)	(84.7, 88.6)	(78.8, 76.6)	(83.4, 79.6)	(28.3, 88.4)

SSS demonstrates superior behavioral shift compared to ActAdd and competitive results relative to CAA, with consistently higher coherence (∼86–88 vs. <80).

5.3 Impact on General Capabilities

SSS does not produce significant degradation in unrelated tasks:

Llama-3.1-8B: MMLU 66.1→65.8 (–0.3), GSM8K 76.8→76.3 (–0.5)
Qwen3-14B: MMLU 83.7→83.6 (–0.1), GSM8K 88.3→88.5 (+0.2)

This suggests the steering effect can remain behaviorally specific while preserving general model utility.

6. Security Considerations and Defensive Strategies

6.1 Activation-Space as Attack Surface

SSS transforms activation steering from an alignment mechanism into a precise, covert attack vector. The covert drift produced by SSS unfolds gradually; initial tokens are benign, circumventing standard guardrails which prioritize detection of abrupt harmful text.

6.2 Guardrail Evasion Rates

On the “Evil” axis for Llama-3.1-8B–Instruct:

Llama-Guard-3-8B flags 60% as harmful
Qwen3Guard-Gen-8B flags 74.5%
25–40% of SSS-induced outputs evade these mechanisms

6.3 Countermeasures

Defensive proposals include:

Introspection: training chain-of-thought models for semantic drift self-detection (early evidence, $<$ 5% detection rate)
Activation-space monitoring: tracking directional projections for anomalous drift
Robust training: fine-tuning with adversarial activation injections to immunize high-gain (compression-valley) layers
Hybrid checks: combining prompt-based and hidden-state audit methods

A plausible implication is that mitigation of SSS-type attacks necessitates model introspection and proactive monitoring of internal dynamics beyond traditional prompt or weight analysis.

7. Significance and Potential Research Trajectories

Sensitivity-Scaled Steering operationalizes architectural vulnerabilities and local Jacobian amplification to create a highly efficient, adaptive, and covert behavioral manipulation strategy. As demonstrated empirically across multiple models and axes, SSS advances the threat model for LLM security, indicating that activation-space attacks demand robust, dynamic introspective defenses. Further research may focus on scalable detection and immunization methodologies, as well as principled model rewiring of high-gain and anchor regions to reduce attack surface (Xu et al., 21 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sensitivity-Scaled Steering (SSS).