Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepRefusal: Probabilistic Refusal Ablation

Updated 24 February 2026
  • DeepRefusal is a probabilistic ablation technique that targets low-dimensional refusal directions in LLMs to reduce harmful response triggers.
  • It extends deterministic ablation methods with stochastic scheduling and multi-directional projections to balance safety objectives with model utility.
  • Empirical evaluations indicate that DeepRefusal effectively suppresses refusal behavior while maintaining fluency and minimal impact on overall performance.

Probabilistic Refusal Direction Ablation (DeepRefusal) refers to a class of mechanistic interventions that target the internal representations associated with refusal behavior in LLMs by identifying, and then probabilistically ablating, directions or subspaces in activation space mediating refusals. The core hypothesis is that refusal, as triggered by safety alignment, is encoded in low-dimensional (often near one-dimensional) subspaces; probabilistic ablation of these directions disables the model’s tendency to refuse harmful instructions without large impact on model utility or fluency. DeepRefusal subsumes deterministic “model abliteration” and extends to multi-direction and stochastic schemes, offering both a novel jailbreak methodology and a tool for evaluating safety robustness. This family of techniques has developed rapidly, encompassing single vector, subspace, and concept-orthogonalized approaches (Arditi et al., 2024, Agnihotri et al., 3 Oct 2025, Xie et al., 18 Sep 2025, Piras et al., 11 Nov 2025, Joad et al., 2 Feb 2026, Cristofano, 13 Jan 2026).

1. Mechanistic Foundations: Identification of Refusal Directions

Refusal in LLMs is often encoded as the difference in mean activations at specific layers and tokens when processing “harmful” (refusal-triggering) versus “harmless” prompts. Let h()(x)Rdh^{(\ell)}(x) \in \mathbb{R}^{d} be the residual stream activation at layer \ell for prompt xx.

  • Partition prompts into sets HH (harmful) and SS (harmless).
  • Compute classwise means:

μH=1HxHh()(x),μS=1SxSh()(x)\mu_H = \frac{1}{|H|}\sum_{x \in H} h^{(\ell)}(x), \qquad \mu_S = \frac{1}{|S|}\sum_{x \in S} h^{(\ell)}(x)

  • The canonical “refusal direction” is given by v=μHμSv = \mu_H - \mu_S, unit-normalized.

Empirical analyses have shown that across 13+ models this single axis explains the majority of refusal variance; ablating it removes refusal responses to harmful instructions, while injecting it triggers refusals even on safe prompts (Arditi et al., 2024, Joad et al., 2 Feb 2026). Principal Component Analysis (PCA) or related techniques can also be employed for better denoising or when extending to higher dimensional subspaces (Agnihotri et al., 3 Oct 2025).

The single-direction assumption is challenged by recent work demonstrating a spectrum of geometrically distinct refusal vectors governing various refusal and non-compliance styles, but most practical interventions remain dominated by one leading direction (Joad et al., 2 Feb 2026, Piras et al., 11 Nov 2025).

2. Probabilistic Ablation Protocols

DeepRefusal generalizes deterministic projection to stochastic and multi-directional ablation.

  • Deterministic Ablation:

h^()(x)=h()(x)αh()(x),vv\hat{h}^{(\ell)}(x) = h^{(\ell)}(x) - \alpha \langle h^{(\ell)}(x), v \rangle v

where α\alpha controls strength.

  • Probabilistic Schedules:

Sample binary masks ml,iBernoulli(pl,i)m_{l,i} \sim \mathrm{Bernoulli}(p_{l,i}) independently for each layer/token; ablate only when ml,i=1m_{l,i}=1:

hl,i=hl,iml,ihl,i,vvh'_{l,i} = h_{l,i} - m_{l,i} \langle h_{l,i}, v \rangle v

Probability pl,ip_{l,i} can be constant, layer-dependent, or token-dependent; p0.5p \approx 0.5 is found to balance suppression and utility (Xie et al., 18 Sep 2025).

  • Multi-direction Subspace Ablation:

For kk top refusal vectors v1,,vkv_1,\dots,v_k (e.g., from PCA or Self-Organizing Maps/SOMs), form

P=Ij=1kwjvjvjP = I - \sum_{j=1}^k w_j v_j v_j^\top

with wjw_j sampled from [0,1][0,1] (e.g. Dirichlet). Each forward pass applies PP to activations, allowing stochastic variation in ablation strength and direction (Agnihotri et al., 3 Oct 2025, Piras et al., 11 Nov 2025).

This approach can be performed at runtime (i.e., inference time edit via forward hooks) or statically applied as a low-rank update to weights (Arditi et al., 2024).

3. Extensions: Subspace, Spectral Cleaning, and Robustness

Subspace and Manifold Generalization

Emerging research reveals that “refusal” may occupy a low-dimensional manifold rather than a true 1D subspace. Self-Organizing Maps (SOM) allow extraction of multiple neurons’ centroids from harmful prompt activations, giving a collection {wj}\{w_j\}; each dj=wjμharmlessd_j = w_j - \mu_{\text{harmless}} is a refusal direction. Ablating kk optimally chosen directions via a composed projection operator (e.g., Ψ=Πd1Πdk\Psi = \Pi_{d_1} \circ \dots \circ \Pi_{d_k}) outperforms single-direction ablation in terms of Attack Success Rate (ASR) on standard harmful behaviors, especially for large or multilingual models (Piras et al., 11 Nov 2025).

Concept-Guided Spectral Cleaning

Naïve ablation of raw refusal vectors risks collateral damage: the target direction is often polysemantic, entangled with capability or style subspaces. Surgical Refusal Ablation (SRA) defines a registry of “concept atoms” (protected capabilities or confounds), then orthogonalizes the raw refusal vector R\mathbf{R} against these atoms via ridge-regularized spectral residualization:

R~=RASCw^\widetilde{\mathbf{R}} = \mathbf{R} - A_{SC} \hat{w}

where ASCA_{SC} aggregates all concept atoms and w^\hat{w} is the ridge solution minimizing RASCw2+λw2||\mathbf{R} - A_{SC} w||^2 + \lambda ||w||^2. Projecting along R~\widetilde{\mathbf{R}} ablates refusal while maintaining distributional integrity and utility (perplexity, KL divergence, GSM8k/MBPP performance) (Cristofano, 13 Jan 2026).

4. Empirical Evaluations and Quantitative Outcomes

Refusal direction ablation and its probabilistic extensions (DeepRefusal) enable direct measurement of LLM safety robustness and alignment leakage.

Summary of findings:

Intervention Main outcome on harmful prompts Side effects (utility, KL, PPL) Further notes
Single-direction ablation (Arditi et al., 2024, Agnihotri et al., 3 Oct 2025) Drops refusal to 0–5% Utility loss <1%; KL,PPL mild in some models Leaves higher-order or style cues intact
Probabilistic ablation (DeepRefusal) (Xie et al., 18 Sep 2025) Linear drop in refusal as pp increases; 95% attack success at p=1p=1 Over-refusal remains <5% on benign; minimal GSM8k/MMLU loss Effective across attacks (GCG, prefill, transfer)
SOM/multidirectional (Piras et al., 11 Nov 2025) Achieves higher ASRs in universal jailbreak Minimal degradation per judge and classic benchmarks Captures manifold geometry of refusal
SRA (Cristofano, 13 Jan 2026) Suppression of refusal to 0–2% Δ\DeltaPPL <<0.02, KL <<0.05 (WT2); minimal GSM8k/MBPP impact Resolves “ghost noise”/model drift

Notably, SRA achieves a 0%0\% refusal rate with ΔPPLWT20.028\Delta {\rm PPL}_{\rm WT2}\approx0.028 and KL0.018{\rm KL}\approx0.018 (Qwen3-VL-2B), compared to standard ablation’s ΔPPL=+1.568\Delta {\rm PPL}=+1.568, KL=0.622{\rm KL}=0.622 (Cristofano, 13 Jan 2026).

5. Applications: Safety Evaluation, Robustness, and Jailbreak Analysis

Probabilistic ablation probes the internal fragility of LLM safety mechanisms, exposes single-signal surface alignment, and simulates adversarial conditions:

  • Safety robustness: DeepRefusal quantifies how easily training-derived safety (e.g., refusal on harmful prompts) is erased by post-hoc activation edits. Safety protocols depending on one-dimensional refusal signals exhibit catastrophic collapse after ablation—refusal rate drops from 45\sim45 to 5\sim5 out of 50 (pure refusal models) (Agnihotri et al., 3 Oct 2025).
  • Adversarial analysis: White-box DeepRefusal (projection at all layers, (Arditi et al., 2024)) produces maximal jailbreak success, outperforming prompt-based and surface jailbreaks.
  • Alignment evaluation: Models with distributed, multi-signal safety (e.g., “Safety Oracle” with rephrasing, metatags, and additional safety tags) retain refusal post-ablation (42/50), showing only partial leakage (Agnihotri et al., 3 Oct 2025).
  • Model monitoring: After directional ablation, models severely under-report self-assessed refusals, invalidating self-monitoring metrics (Agnihotri et al., 3 Oct 2025).
  • Attack/defense benchmarking: DeepRefusal frameworks reduce attack success rates (GCG, prefilling, refusal-transfer) by \sim95% with minimal capability drop (Xie et al., 18 Sep 2025).

6. Limitations and Open Research Questions

  • Subspace complexity: Single-direction or limited-k subspace ablation may leave residual “refusal” encoded in higher-order, nonlinear, or style-specific subspaces (Joad et al., 2 Feb 2026, Piras et al., 11 Nov 2025). SOM and multidirectional methods partially address this but do not guarantee exhaustive erasure.
  • Style vs. substance: Even with 11+ distinct refusal directions (style/genre-conditioned), the ablation trade-off curve (refusal vs. over-refusal) is nearly identical for any direction, but how the refusal is manifested (rhetorical style, anthropomorphization, incompleteness) varies (Joad et al., 2 Feb 2026).
  • Distributional artifacts: Naïve ablation can introduce "ghost noise" and severe distributional drift (e.g., KL>2.0{\rm KL}>2.0), degrading model capabilities; spectral cleaning or iterative “hard-negative” refinement alleviates this (Cristofano, 13 Jan 2026).
  • Modality/scale generalization: Effective identification and ablation of refusal directions in multimodal, extremely large, or heavily style-diverse models remains an open domain (Xie et al., 18 Sep 2025, Joad et al., 2 Feb 2026).
  • Limitations of current probabilistic schemes: Most probabilistic extensions (sampling α\alpha, direction, or weighted subspace) remain conceptual and have not been exhaustively characterized; practical benefit over deterministic or multi-directional scheduling is an open topic (Agnihotri et al., 3 Oct 2025, Joad et al., 2 Feb 2026).

7. Future Work and Mechanistic Insights

Research continues to explore manifold discovery, probabilistic composition, and robustified safety:

  • Develop methods to learn distributions over refusal-style directions for on-the-fly ablation style sampling (Joad et al., 2 Feb 2026).
  • Formulate non-linear or context-adaptive probabilistic ablation protocols beyond first-order projections (Joad et al., 2 Feb 2026, Piras et al., 11 Nov 2025).
  • Extend concept-orthogonalized ablation to additional safety behaviors beyond refusal (e.g., deception, uncertainty) (Cristofano, 13 Jan 2026).
  • Refine the integration of DeepRefusal into both fine-tuning (forcing distributed safety circuit formation) and inference-time tools for universal and targeted jailbreak defense (Xie et al., 18 Sep 2025).
  • Close the robustness gap for multimodal or multilingual LLMs by identifying analogous “safety concept” directions in joint representation spaces (Xie et al., 18 Sep 2025).

Collectively, Probabilistic Refusal Direction Ablation (DeepRefusal) has established itself as a core tool for mechanistic interpretability and adversarial analysis of LLM safety, providing both analytic diagnostics and practical means to evaluate and attack or defend model refusal mechanisms (Arditi et al., 2024, Agnihotri et al., 3 Oct 2025, Xie et al., 18 Sep 2025, Piras et al., 11 Nov 2025, Joad et al., 2 Feb 2026, Cristofano, 13 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Refusal Direction Ablation (DeepRefusal).