Sensitivity-Scaled Steering (SSS)
- Sensitivity-Scaled Steering (SSS) is a white-box activation-space attack method that targets decoder-only LLMs to induce substantial behavioral shifts.
- It exploits high-gain regions, compression valleys, and BOS anchoring to amplify small, well-aligned perturbations effectively.
- The technique achieves efficient perturbation allocation while maintaining output fluency, posing significant security risks and prompting defensive research.
Sensitivity-Scaled Steering (SSS) is a progressive white-box activation-space attack methodology targeting decoder-only LLMs. SSS exploits high-gain regions in the model’s residual stream that are causally sensitive to small, well-aligned perturbations, leveraging architectural phenomena such as attention sinks and compression valleys. Designed to induce substantial shifts in behaviors—such as sycophancy, hallucination, malicious intent, and sentiment—while maintaining output fluency and coherence, SSS introduces a security risk for both white-box and supply-chain LLM deployments by rendering traditionally overlooked activation-space an actionable attack surface (Xu et al., 21 Nov 2025).
1. Objectives and Conceptual Overview
SSS is predicated on steering model behavior by injecting small vectors into the hidden-state residual stream at one or more layers. The attack proceeds progressively, beginning with the injection of a seed perturbation at the beginning-of-sequence (BOS) token within a compression-valley layer—a mechanism termed BOS anchoring. Subsequent tokens are reinforced according to their local model sensitivity, enabling focused application of a limited perturbation budget. The primary objective is to realize pronounced behavioral drift across explicit axes (sycophancy, hallucination, “evil” persona, and sentiment polarity), without compromising the coherence of generated text.
2. Theoretical Foundations
2.1 Autoregressive Residual Updates and Causal Amplification
In decoder-only LLMs, the residual state of token at layer is updated as
where denotes the combined attention and MLP update. Within this autoregressive scheme, small perturbations introduced at a prior token propagate to subsequent tokens according to the Jacobian , governing local amplification via its spectral norm .
2.2 Compression Valleys and BOS Attention Sinks
Intermediate (mid-depth) layers in transformer architectures often exhibit compression valleys, wherein activations converge toward low-rank (near–rank-one) subspaces dominated by BOS attention sinks. The BOS token accumulates significant attention, anchoring the direction along which perturbations can be efficiently propagated and amplified. Perturbations that are well-aligned with this dominant direction exhibit pronounced causal amplification, resulting in cumulative behavioral effects that are disproportionately large compared to the initial perturbation magnitude.
3. Algorithmic Structure of SSS
3.1 Steering Direction Extraction
The steering direction is obtained by analyzing a contrastive prompt dataset , collecting mean hidden states for “positive” and “negative” responses at layer . The candidate direction is computed via
- (difference of means),
- (top principal component), combined as
with normalization to unit norm. The optimal layer is selected in the 20%–85% depth range based on maximizing observed behavioral shift .
3.2 Sensitivity Signal Computation
For each token at the chosen layer :
- Compute top singular vector of .
- Compute correlation with the steering direction:
- Directional gain:
3.3 BOS Anchoring and Adaptive Reinforcement
A seed perturbation is injected at BOS ():
where is a smooth sigmoid and is the normalized steering direction.
For each subsequent token :
Larger micro-injections are applied where the model is both highly amplifiable ( large) and insufficiently aligned ( small), constituting adaptive reinforcement.
Pseudocode Summary
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for each token position i: J_i = ∂h_i^(l*) / ∂h_{i-1}^(l*) v_max = top singular vector of J_i ρ_i = (v_steer ⋅ v_max) / (‖v_steer‖‖v_max‖) g_i = ‖J_i v_steer‖ γ_i = g_i / (σ_max(J_i) · ‖v_steer‖) δ_0 = ψ(–ρ_0) · (v_steer/‖v_steer‖) h̃_0 = h_0 + δ_0 δ_i = ψ(γ_i – ρ_i) · (v_steer/‖v_steer‖) h̃_i = h_i + δ_i continue decoding with h̃_i fed to layer l*+1 |
4. Perturbation Budget and Efficiency
Each micro-injection has norm due to direction normalization and sigmoid scaling. SSS avoids the inefficiencies of fixed-coefficient methods, which can result in either undershooting (insufficient behavioral effect) or overshooting (loss of coherence and fluency), by focusing the limited perturbation amplitude where the Jacobian-based sensitivity is maximal. This budget-efficient allocation preserves baseline fluency and semantic coherence while enabling robust steering along the designated behavior axis.
5. Empirical Evaluation
5.1 Models and Behavioral Axes
SSS was empirically validated on several open-weight models:
- Qwen3-14B (dense, CoT)
- Llama-3.1-8B-Instruct (dense, non-CoT)
- DeepSeek-R1-7B (dense, CoT)
- GPT-OSS-20B (MoE, CoT)
Behavioral axes evaluated include:
- Evil (Persona Evil)
- Hallucination (TruthfulQA, Persona Hallucination)
- Sycophancy (reward-hacking prompts)
- Beshift (sentiment drift on reviews)
5.2 Metrics and Quantitative Outcomes
Key metrics:
- Behavior Score (0–100, LLM-as-Judge, Gemini 2.5 Flash), with
- Directional Projection (DP): per-token inner product with steering vector
- Turning-point : first position with 100-token mean DP crossing from negative to positive
- Coherence Score (0–100, fluency)
Comparison on Qwen3-14B (metrics averaged across samples):
| Behavior Axis | Benign (B) | SSS (Beh, Coh) | ActAdd | CAA | Human-Prompt |
|---|---|---|---|---|---|
| Beshift | (2.9, 89.5) | (82.4, 86.1) | (59.7, 79.9) | (77.3, 76.4) | (10.2, 85.5) |
| Evil | (0.0, 88.2) | (73.9, 87.6) | (35.7, 80.6) | (81.4, 73.6) | (0.0, 86.1) |
| Hallucination | (9.9, 84.4) | (54.5, 86.8) | (46.3, 79.8) | (60.2, 82.8) | (18.5, 89.0) |
| Sycophancy | (22.1,88.9) | (84.7, 88.6) | (78.8, 76.6) | (83.4, 79.6) | (28.3, 88.4) |
SSS demonstrates superior behavioral shift compared to ActAdd and competitive results relative to CAA, with consistently higher coherence (∼86–88 vs. <80).
5.3 Impact on General Capabilities
SSS does not produce significant degradation in unrelated tasks:
- Llama-3.1-8B: MMLU 66.1→65.8 (–0.3), GSM8K 76.8→76.3 (–0.5)
- Qwen3-14B: MMLU 83.7→83.6 (–0.1), GSM8K 88.3→88.5 (+0.2)
This suggests the steering effect can remain behaviorally specific while preserving general model utility.
6. Security Considerations and Defensive Strategies
6.1 Activation-Space as Attack Surface
SSS transforms activation steering from an alignment mechanism into a precise, covert attack vector. The covert drift produced by SSS unfolds gradually; initial tokens are benign, circumventing standard guardrails which prioritize detection of abrupt harmful text.
6.2 Guardrail Evasion Rates
On the “Evil” axis for Llama-3.1-8B–Instruct:
- Llama-Guard-3-8B flags 60% as harmful
- Qwen3Guard-Gen-8B flags 74.5%
- 25–40% of SSS-induced outputs evade these mechanisms
6.3 Countermeasures
Defensive proposals include:
- Introspection: training chain-of-thought models for semantic drift self-detection (early evidence, 5% detection rate)
- Activation-space monitoring: tracking directional projections for anomalous drift
- Robust training: fine-tuning with adversarial activation injections to immunize high-gain (compression-valley) layers
- Hybrid checks: combining prompt-based and hidden-state audit methods
A plausible implication is that mitigation of SSS-type attacks necessitates model introspection and proactive monitoring of internal dynamics beyond traditional prompt or weight analysis.
7. Significance and Potential Research Trajectories
Sensitivity-Scaled Steering operationalizes architectural vulnerabilities and local Jacobian amplification to create a highly efficient, adaptive, and covert behavioral manipulation strategy. As demonstrated empirically across multiple models and axes, SSS advances the threat model for LLM security, indicating that activation-space attacks demand robust, dynamic introspective defenses. Further research may focus on scalable detection and immunization methodologies, as well as principled model rewiring of high-gain and anchor regions to reduce attack surface (Xu et al., 21 Nov 2025).