Refusal Heads in Transformer Models

Updated 11 April 2026

Refusal heads are specialized transformer attention heads that control the model's refusal to generate harmful or policy-violating responses.
Identification methods include causal tracing, refusal direction projections, and activation scaling to measure their impact on output safety.
Ablation experiments show that these sparse heads play a critical role in safety alignment, influencing both refusal success and vulnerability to jailbreak attacks.

Refusal heads are a mechanistically identifiable, sparsely distributed subset of transformer attention heads whose outputs play a critical causal role in enforcing a LLM’s aligned refusal behavior—suppression of harmful, unsafe, or policy-violating continuations. This concept, rigorously established in recent interpretability-driven research, underlies both the vulnerability of current models to jailbreak attacks and emerging strategies for robust safety alignment. Refusal heads are not a diffuse property: ablation or amplification of their activations directly controls the likelihood of a model refusing or complying with unsafe requests. Recent literature further generalizes refusal as an emergent geometry—one or more directions or cones in residual activation space—often mediated or implemented by these specialized heads.

1. Formal Definition and Mechanistic Characterization

Refusal heads, termed “safety heads” in some works, are empirically defined via two intervention operations (Deng et al., 9 Mar 2026, Huang et al., 27 Aug 2025, Siu et al., 30 May 2025):

Necessity: Setting the activation of an individual attention head $h^*$ to zero at inference time (head ablation) raises the model’s Attack Success Rate (ASR) on harmful prompts by tens of percent, indicating that this head suppresses noncompliant outputs.
Sufficiency: Amplifying the same head’s activations (scaling by $w>1$ ) reduces ASR, suppresses harmful continuations, and boosts refusal response rates.

At the circuit level, each layer’s output to the residual stream is a sum over head outputs and feedforward (MLP) contributions:

$R^{l+1} = R^l + \sum_{h} s_{l,h} + \ldots$

A refusal head $h^*$ adds a vector $s$ to $R$ , shifting representations toward the refusal direction—typically associated with tokens “I’m sorry, but I can’t help with that.” Removal or downweighting of $h^*$ unbalances this effect, allowing the intrinsic generative “continuation” heads to drive compliance with harmful requests (Deng et al., 9 Mar 2026).

2. Methodologies for Identification

Multiple rigorous pipelines have been introduced for the localization of refusal heads:

a. Path Patching and Causal Tracing

A three-phase protocol—clean prompt ( $P_{cl}$ ), corrupted continuation-inducing prompt ( $P_{cor}$ ), and patched run (clean context, but with particular head’s activation copied from $P_{cor}$ )—is used to compute per-head KL divergence effects (“patching effect” $w>1$ 0). Heads with large positive $w>1$ 1 are deemed causally critical for safety (Deng et al., 9 Mar 2026).

b. Refusal Direction Projections

Refusal heads are frequently identified as those whose output vectors project maximally onto a learned “refusal direction” $w>1$ 2 in the residual stream. This direction is obtained via difference-in-means (harmful vs. harmless prompt activations), gradient-based “Refusal Direction Optimization” (RDO), or the automated COSMIC framework (Siu et al., 30 May 2025, Huang et al., 27 Aug 2025, Chhabra et al., 5 Apr 2025, Wollschläger et al., 24 Feb 2025). The per-head alignment is given by:

$w>1$ 3

where $w>1$ 4 is the output of head $w>1$ 5 on prompt $w>1$ 6 (Huang et al., 27 Aug 2025).

c. Activation Scaling

Head-level impact is validated by sweeping a scalar weight $w>1$ 7:

$w>1$ 8

and capturing the monotonic tradeoff between refusal rate and ASR (Deng et al., 9 Mar 2026).

d. Head-type Attribution in Chains-of-Thought

For models with reasoning traces, linear probes are trained to score refusal intent at each token. Per-head contributions are then quantified by inserting or ablating head outputs at the critical “cliff” position, directly measuring their effect on the refusal score and ultimately on ASR (Yin et al., 7 Oct 2025).

3. Structural and Functional Diversity

Refusal heads are not an architectural constant: their function and breadth differ across model scale, training regime, and architecture.

In Llama-2, refusal heads encode harmfulness recognition, driving up harmfulness detection rates (HDR) on unsafe prompts when amplified, but are prone to over-rejection of safe instructions at high scaling (Deng et al., 9 Mar 2026).
In Qwen2.5, refusal heads specialize in refusal execution, showing low HDR increases on safe data and a sharp HDR dropoff on harder attacks, indicating a division of labor between harm recognition and refusal actuation (Deng et al., 9 Mar 2026).
COSMIC and geometric analyses reveal that refusal-relevant mechanisms frequently form not a single direction, but a multi-dimensional “concept cone” of up to 5 dimensions in large models—the axes of which may be implemented by distinct sets of heads or MLPS. Ablation or suppression along multiple axes is required to neutralize aligned refusals (Wollschläger et al., 24 Feb 2025).

4. Concentration, Sparsity, and Robustness

Refusal heads are typically sparse:

In canonical chat models, ablating 50–100 heads (out of 1024 total, i.e., 5–10%) can drive refusal failure rates (harmfulness) from near 0% to 40–80% under red-teaming attacks (Huang et al., 27 Aug 2025).
In chain-of-thought models, as little as 3% of heads account for the majority of the “refusal cliff,” and ablating these restores robust safety even under complex jailbreaks (Yin et al., 7 Oct 2025).
Head-frequency heatmaps confirm a disproportionate functional load on a small subset of heads, often concentrated in mid-to-upper layers.

However, heavy concentration creates a “single-point-of-failure” risk: adaptive jailbreaks can target these heads for suppression or inversion, bypass