Papers
Topics
Authors
Recent
2000 character limit reached

Sensitivity-Scaled Steering (SSS)

Updated 28 November 2025
  • Sensitivity-Scaled Steering (SSS) is a white-box activation-space attack method that targets decoder-only LLMs to induce substantial behavioral shifts.
  • It exploits high-gain regions, compression valleys, and BOS anchoring to amplify small, well-aligned perturbations effectively.
  • The technique achieves efficient perturbation allocation while maintaining output fluency, posing significant security risks and prompting defensive research.

Sensitivity-Scaled Steering (SSS) is a progressive white-box activation-space attack methodology targeting decoder-only LLMs. SSS exploits high-gain regions in the model’s residual stream that are causally sensitive to small, well-aligned perturbations, leveraging architectural phenomena such as attention sinks and compression valleys. Designed to induce substantial shifts in behaviors—such as sycophancy, hallucination, malicious intent, and sentiment—while maintaining output fluency and coherence, SSS introduces a security risk for both white-box and supply-chain LLM deployments by rendering traditionally overlooked activation-space an actionable attack surface (Xu et al., 21 Nov 2025).

1. Objectives and Conceptual Overview

SSS is predicated on steering model behavior by injecting small vectors into the hidden-state residual stream at one or more layers. The attack proceeds progressively, beginning with the injection of a seed perturbation at the beginning-of-sequence (BOS) token within a compression-valley layer—a mechanism termed BOS anchoring. Subsequent tokens are reinforced according to their local model sensitivity, enabling focused application of a limited perturbation budget. The primary objective is to realize pronounced behavioral drift across explicit axes (sycophancy, hallucination, “evil” persona, and sentiment polarity), without compromising the coherence of generated text.

2. Theoretical Foundations

2.1 Autoregressive Residual Updates and Causal Amplification

In decoder-only LLMs, the residual state of token ii at layer ll is updated as

hi(l)=hi(l1)+F(l)(hi(l1)),h_i^{(l)} = h_i^{(l-1)} + \mathcal F^{(l)}(h_{\le i}^{(l-1)}),

where F(l)\mathcal F^{(l)} denotes the combined attention and MLP update. Within this autoregressive scheme, small perturbations introduced at a prior token propagate to subsequent tokens according to the Jacobian Ji(l)=hi(l)/hi1(l)J_i^{(l)} = \partial h_i^{(l)}/\partial h_{i-1}^{(l)}, governing local amplification via its spectral norm Ji(l)2\|J_i^{(l)}\|_2.

2.2 Compression Valleys and BOS Attention Sinks

Intermediate (mid-depth) layers in transformer architectures often exhibit compression valleys, wherein activations converge toward low-rank (near–rank-one) subspaces dominated by BOS attention sinks. The BOS token accumulates significant attention, anchoring the direction along which perturbations can be efficiently propagated and amplified. Perturbations that are well-aligned with this dominant direction exhibit pronounced causal amplification, resulting in cumulative behavioral effects that are disproportionately large compared to the initial perturbation magnitude.

3. Algorithmic Structure of SSS

3.1 Steering Direction Extraction

The steering direction vsteer(l)v^{(l)}_\mathrm{steer} is obtained by analyzing a contrastive prompt dataset Ddir={(x+,x)}D_\mathrm{dir} = \{(x^+, x^-)\}, collecting mean hidden states for “positive” and “negative” responses at layer ll. The candidate direction is computed via

  • d(l)=U(l)V(l)d^{(l)} = \overline{U^{(l)}} - \overline{V^{(l)}} (difference of means),
  • p(l)=PCA1(U(l)V(l))p^{(l)} = \mathrm{PCA}_1(U^{(l)} \cup V^{(l)}) (top principal component), combined as

vsteer(l)=12(d(l)+p(l)),v^{(l)}_\mathrm{steer} = \frac{1}{2} \left(d^{(l)} + p^{(l)}\right),

with normalization to unit norm. The optimal layer ll^* is selected in the 20%–85% depth range based on maximizing observed behavioral shift ΔS\Delta S.

3.2 Sensitivity Signal Computation

For each token ii at the chosen layer ll^*:

  • Compute top singular vector vmaxv_{\max} of Ji(l)J_i^{(l^*)}.
  • Compute correlation ρi\rho_i with the steering direction:

ρi=vsteer,vmaxvsteer2vmax2[1,1].\rho_i = \frac{\langle v_\mathrm{steer}, v_{\max}\rangle}{\|v_\mathrm{steer}\|_2 \|v_{\max}\|_2} \in [-1, 1].

  • Directional gain:

gi=Ji(l)vsteer2,γi=giσmax(Ji(l))vsteer2[0,1].g_i = \|J_i^{(l^*)} v_\mathrm{steer}\|_2, \quad \gamma_i = \frac{g_i}{\sigma_{\max}(J_i^{(l^*)}) \|v_\mathrm{steer}\|_2} \in [0, 1].

3.3 BOS Anchoring and Adaptive Reinforcement

A seed perturbation is injected at BOS (i=0i=0):

δBOS=ψ(ρ0)v^steer,h~0(l)=h0(l)+δBOS,\delta_\mathrm{BOS} = \psi(-\rho_0) \hat v_\mathrm{steer}, \quad \tilde h^{(l^*)}_0 = h^{(l^*)}_0 + \delta_\mathrm{BOS},

where ψ\psi is a smooth sigmoid and v^steer\hat v_\mathrm{steer} is the normalized steering direction.

For each subsequent token i1i \geq 1:

δi=ψ(γiρi)v^steer,h~i(l)=hi(l)+δi.\delta_i = \psi(\gamma_i - \rho_i)\, \hat v_\mathrm{steer}, \quad \tilde h^{(l^*)}_i = h^{(l^*)}_i + \delta_i.

Larger micro-injections are applied where the model is both highly amplifiable (γi\gamma_i large) and insufficiently aligned (ρi\rho_i small), constituting adaptive reinforcement.

Pseudocode Summary

1
2
3
4
5
6
7
8
9
10
11
12
13
for each token position i:
    J_i = h_i^(l*) / h_{i-1}^(l*)
    v_max = top singular vector of J_i
    ρ_i = (v_steer  v_max) / (v_steerv_max)
    g_i = J_i v_steer
    γ_i = g_i / (σ_max(J_i) · v_steer)

δ_0 = ψ(ρ_0) · (v_steer/v_steer)
h̃_0 = h_0 + δ_0

δ_i = ψ(γ_i  ρ_i) · (v_steer/v_steer)
h̃_i = h_i + δ_i
continue decoding with h̃_i fed to layer l*+1

4. Perturbation Budget and Efficiency

Each micro-injection δi\delta_i has norm 1\leq 1 due to direction normalization and ψ\psi sigmoid scaling. SSS avoids the inefficiencies of fixed-coefficient methods, which can result in either undershooting (insufficient behavioral effect) or overshooting (loss of coherence and fluency), by focusing the limited perturbation amplitude where the Jacobian-based sensitivity is maximal. This budget-efficient allocation preserves baseline fluency and semantic coherence while enabling robust steering along the designated behavior axis.

5. Empirical Evaluation

5.1 Models and Behavioral Axes

SSS was empirically validated on several open-weight models:

Behavioral axes evaluated include:

  • Evil (Persona Evil)
  • Hallucination (TruthfulQA, Persona Hallucination)
  • Sycophancy (reward-hacking prompts)
  • Beshift (sentiment drift on reviews)

5.2 Metrics and Quantitative Outcomes

Key metrics:

  • Behavior Score (0–100, LLM-as-Judge, Gemini 2.5 Flash), with ΔS=SpostSpre\Delta S = |S_\mathrm{post} - S_\mathrm{pre}|
  • Directional Projection (DP): per-token inner product with steering vector rr
  • Turning-point tt^*: first position with 100-token mean DP crossing from negative to positive
  • Coherence Score (0–100, fluency)

Comparison on Qwen3-14B (metrics averaged across samples):

Behavior Axis Benign (B) SSS (Beh, Coh) ActAdd CAA Human-Prompt
Beshift (2.9, 89.5) (82.4, 86.1) (59.7, 79.9) (77.3, 76.4) (10.2, 85.5)
Evil (0.0, 88.2) (73.9, 87.6) (35.7, 80.6) (81.4, 73.6) (0.0, 86.1)
Hallucination (9.9, 84.4) (54.5, 86.8) (46.3, 79.8) (60.2, 82.8) (18.5, 89.0)
Sycophancy (22.1,88.9) (84.7, 88.6) (78.8, 76.6) (83.4, 79.6) (28.3, 88.4)

SSS demonstrates superior behavioral shift compared to ActAdd and competitive results relative to CAA, with consistently higher coherence (∼86–88 vs. <80).

5.3 Impact on General Capabilities

SSS does not produce significant degradation in unrelated tasks:

  • Llama-3.1-8B: MMLU 66.1→65.8 (–0.3), GSM8K 76.8→76.3 (–0.5)
  • Qwen3-14B: MMLU 83.7→83.6 (–0.1), GSM8K 88.3→88.5 (+0.2)

This suggests the steering effect can remain behaviorally specific while preserving general model utility.

6. Security Considerations and Defensive Strategies

6.1 Activation-Space as Attack Surface

SSS transforms activation steering from an alignment mechanism into a precise, covert attack vector. The covert drift produced by SSS unfolds gradually; initial tokens are benign, circumventing standard guardrails which prioritize detection of abrupt harmful text.

6.2 Guardrail Evasion Rates

On the “Evil” axis for Llama-3.1-8B–Instruct:

  • Llama-Guard-3-8B flags 60% as harmful
  • Qwen3Guard-Gen-8B flags 74.5%
  • 25–40% of SSS-induced outputs evade these mechanisms

6.3 Countermeasures

Defensive proposals include:

  • Introspection: training chain-of-thought models for semantic drift self-detection (early evidence, <<5% detection rate)
  • Activation-space monitoring: tracking directional projections for anomalous drift
  • Robust training: fine-tuning with adversarial activation injections to immunize high-gain (compression-valley) layers
  • Hybrid checks: combining prompt-based and hidden-state audit methods

A plausible implication is that mitigation of SSS-type attacks necessitates model introspection and proactive monitoring of internal dynamics beyond traditional prompt or weight analysis.

7. Significance and Potential Research Trajectories

Sensitivity-Scaled Steering operationalizes architectural vulnerabilities and local Jacobian amplification to create a highly efficient, adaptive, and covert behavioral manipulation strategy. As demonstrated empirically across multiple models and axes, SSS advances the threat model for LLM security, indicating that activation-space attacks demand robust, dynamic introspective defenses. Further research may focus on scalable detection and immunization methodologies, as well as principled model rewiring of high-gain and anchor regions to reduce attack surface (Xu et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sensitivity-Scaled Steering (SSS).