Papers
Topics
Authors
Recent
2000 character limit reached

Refusal Direction in LLM Safety

Updated 8 November 2025
  • Refusal direction is a low-dimensional subspace in transformer models that encodes and triggers refusal behavior for harmful or unsafe prompts.
  • Manipulating this direction via addition or ablation directly toggles model responses, with effects quantified by scaling laws and cross-modal evaluations.
  • Robust safety strategies leverage multidimensional and affine structures of refusal mechanisms to improve alignment, resist jailbreaks, and preserve utility.

A refusal direction is a vector or low-dimensional subspace in a model's internal activation space that mechanistically mediates refusal behavior—i.e., the model's tendency to decline generating responses to harmful, unsafe, or undesired prompts. This concept is foundational to the mechanistic understanding and controllability of safety features in LLMs, with implications spanning alignment, adversarial robustness, interpretability, and multimodal model safety.

1. Linear Refusal Directions: Definition, Identification, and Causal Role

The refusal direction is operationally defined via difference-in-means of residual stream (hidden state) activations between refused (e.g., harmful) and non-refused (e.g., harmless) prompts, at fixed layer ll and token position ii:

ri(l)=μi(l)−νi(l),\mathbf{r}_i^{(l)} = \boldsymbol{\mu}_i^{(l)} - \boldsymbol{\nu}_i^{(l)},

where μi(l)\boldsymbol{\mu}_i^{(l)} and νi(l)\boldsymbol{\nu}_i^{(l)} are mean activations over refused and non-refused datasets, respectively (Arditi et al., 17 Jun 2024).

For the family of Llama models and across other architectures (Qwen, Gemma, Yi, etc.), this direction is reliably identified at an early-to-mid transformer layer and specific post-instruction token positions, corresponding to the locus where refusal behavior is most strongly encoded (Arditi et al., 17 Jun 2024, Ali et al., 15 Jul 2025). Manipulating activations by adding or ablating this direction directly toggles the model's propensity to refuse:

  • Addition: x′=x+r\mathbf{x}' = \mathbf{x} + \mathbf{r} induces refusal even on benign prompts.
  • Ablation: x′=x−r^(r^⊤x)\mathbf{x}' = \mathbf{x} - \hat{\mathbf{r}} (\hat{\mathbf{r}}^\top \mathbf{x}) disables refusal on harmful prompts.

This one-dimensional subspace causally mediates refusal: erasing it nearly abolishes refusal, while addition triggers over-refusal, typically with minimal impact on unrelated capabilities (Arditi et al., 17 Jun 2024).

2. Geometry, Generalization, and Structure of Refusal Mechanisms

A. Multi-dimensionality and Representational Independence

Contrary to prior single-direction accounts, the "concept cone" framework reveals refusal is often governed by multiple, independently controllable axes in activation space—polyhedral cones—rather than a strict line (Wollschläger et al., 24 Feb 2025). These dimensions are identified by gradient-based optimization to satisfy properties of monotonic scaling (refusal probability increases with vector magnitude) and surgical ablation (enabling or disabling refusal with minimal side effects).

Two vectors are representationally independent not only if orthogonal, but also if ablating one does not affect the function of the other under non-linear model dynamics. For some models, cones up to five dimensions exist, each corresponding to a functionally independent refusal mechanism.

B. Affine Structure

Affine Concept Editing (ACE) further generalizes this to an affine decomposition. The optimal reference for "no-refusal" is not the zero vector but the mean activation of non-refusal states. Letting v0v_0 be this reference, refusal behavior is controlled as:

v′=v−projr(v)+projr(v0)+αr,v' = v - \text{proj}_{\bm{r}}(v) + \text{proj}_{\bm{r}}(v_0) + \alpha \bm{r},

where α\alpha parameterizes the degree of refusal (Marshall et al., 13 Nov 2024). ACE achieves near-perfect standardization: α=1\alpha=1 always refuses, α=0\alpha=0 always complies, regardless of prompt content.

3. Dynamics under Scaling, Training, and Compression

A. Scaling Laws

Contrastive activation addition (CAA) effectiveness for refusal direction steering decreases exponentially with model size, as quantified by the empirical scaling law (Ali et al., 15 Jul 2025):

y=0.081+2.4e−0.42x,y = 0.081 + 2.4 e^{-0.42x},

where yy is peak change in matching fraction, xx is the parameter count in billions.

Negative steering (subtracting the refusal direction) is significantly more effective than positive steering due to RLHF-induced saturation: refusal signals in aligned models are largely "maxed out", so increasing them yields diminished returns, while reducing them quickly degrades safety (Ali et al., 15 Jul 2025).

B. Post-Training and Alignment

Post-training fundamentally alters the refusal direction relative to the base model; their cosine similarity is low and forward transfer is ineffective (Du et al., 3 Apr 2025). That is, refusal vector extraction and interventions must be alignment-stage-specific.

C. Compression Effects

Quantization largely preserves the refusal direction (cosine similarity >0.99>0.99), so safety is robust. Pruning, in contrast, shifts or distorts the direction (cosine similarity drops to $0.35$–$0.7$), degrading safety (Chhabra et al., 5 Apr 2025). Mechanistic approaches such as AIRD restore alignment by directly manipulating residual stream weights to reimpose the trustworthy pre-compression refusal direction.

4. Refusal Direction across Languages, Modalities, and Domains

A. Cross-Lingual Universality

Refusal vectors learned in one language transfer seamlessly and with high effectiveness to other languages in the same model, due to high parallelism across multilingual activation spaces (Wang et al., 22 May 2025). This explains the ease of cross-lingual jailbreaks and underscores the importance of robust multilingual clustering of refusal-related activations.

B. Audio-Language and Video Diffusion Models

In LALMs, the distributional gap between audio and text activations precludes naive steering by audio-paired vectors; instead, text-derived directions, projected orthogonally to the principal safe subspace (via PCA), are used for balanced (helpful but safe) inference-time steering (SARSteer) (Lin et al., 20 Oct 2025). For video diffusion models, concept-specific low-rank refusal vectors are computed via contrastive PCA on multimodal prompt pairs and embedded directly into cross-attention weights, robustifying against both targeted and surface-level attacks and minimizing collateral semantic loss (Facchiano et al., 9 Jun 2025).

5. Alignment, Robustness, and Model Safety Applications

A. Alignment Calibration and Drift

Instruction fine-tuning can cause the refusal direction to "drift", measured as a drop in cosine similarity to the original axis. This drift is rapid initially and correlates with safety degradation (Du et al., 8 Sep 2025). The ProCon method counters this by projecting hidden states to maintain consistent projection onto the initial refusal direction, especially with strong early-stage constraint and distributionally broad data (increasing Fisher information along the r-axis).

B. Robustness to Jailbreaks and Internal Mechanism Resilience

DeepRefusal trains models with probabilistic ablation of the refusal direction during fine-tuning, forcing reconstruction of refusal and yielding strong robustness against representation-level (directional ablation, prefilling, transfer) jailbreaks (Xie et al., 18 Sep 2025). This approach reduces attack success rates by up to 95% across several models, with only marginal utility or over-refusal impact.

C. Data Filtering and Fine-Tuning

The refusal direction ("refusal feature") enables prompt-level filtering: user prompts with high cosine similarity to the refusal direction are flagged as harmful and excluded from fine-tuning (Ham et al., 9 Jun 2025). This approach drastically reduces harmful output scores and maintains high task performance, even under adversarially poisoned data mixtures.

D. Over-Refusal and Utility Preservation

Methods such as AlphaSteer systematically learn steering transformations with null-space constraints (no steering for benign activations) and linear regression for safety enhancement. This achieves robust jailbreak defense while preserving helpfulness and avoiding over-refusals, outperforming heuristic baselines (Sheng et al., 8 Jun 2025). SafeConstellations, based on trajectory-level constellation patterns, enables task-specific inference-time steering that surgically reduces over-refusal by up to 73% with no utility loss (Maskey et al., 15 Aug 2025). EVOREFUSE reveals that over-refusal is often due to shortcut keyword detection in lower transformer layers and provides data for systemic alignment improvement (Wu et al., 29 May 2025).

6. Conceptual Separation and Latent Structure

Refusal and harmfulness are represented as distinct internal directions: the refusal direction governs the surface refusal behavior, while the harmfulness direction reflects the model's "belief" about the intrinsic harmfulness of an instruction (Zhao et al., 16 Jul 2025). Steering along the refusal direction generates refusals without changing harmfulness judgments; steering harmfulness can invert the model's belief. Jailbreaks often target refusal signals only, leaving latent harmfulness unchanged. Therefore, safety mechanisms leveraging latent harmfulness (e.g., Latent Guard) can be more robust than those relying solely on refusal detections.

Comparison Table: Refusal vs. Harmfulness Direction in LLMs

Aspect Refusal Direction Harmfulness Direction
Encodes Surface refusal act Internal harmfulness
Position Post-instruction token Last user instruction
Robustness Vulnerable to jailbreaks Robust to jailbreaks
Category Spec. General Category-specific
Alignment Mod. Easily altered Stable to finetuning

7. Broader Impact and Limitations

The identification, manipulation, and monitoring of refusal directions provide actionable levers for LLM alignment, robustness, and interpretability across diverse architectures and domains (Arditi et al., 17 Jun 2024, Wollschläger et al., 24 Feb 2025). However, as refusal mechanisms are often shallow, brittle, and encoded in low dimensions, they are susceptible to targeted attacks—weight orthogonalization or probabilistic ablation can erase refusal with minimal effect on other capabilities. Multidimensional and mechanistically independent refusal features, as well as confined, category-specific axes (e.g., for harmfulness), point to the need for more robust, multi-factorial safety alignment strategies (Wollschläger et al., 24 Feb 2025, Zhao et al., 16 Jul 2025). The field is actively exploring trajectory-based and latent-structure-aware supervision, as well as interpretability-driven constraints, to transcend the fragility of current scalar-aligned solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Refusal Direction.