Refusal Unlearning in Large-Scale Models
- Refusal unlearning is the systematic process of training models to explicitly generate refusal responses for specified prompts, ensuring safety and compliance.
- It employs techniques such as instruction tuning, preference optimization, and RL-based boundary learning to maintain overall utility while enforcing abstention.
- Applications span LLMs, MLLMs, and video diffusion models, with challenges including over-refusal, hallucination risks, and emergent misalignment.
Refusal unlearning is the systematic process of training large-scale machine learning models—most notably LLMs, multimodal LLMs (MLLMs), and video diffusion models—to respond with explicit refusals (“I don’t know,” “I cannot answer that,” or similarly structured outputs) on specified queries or concepts. This paradigm is distinguished from generic knowledge deletion or data unlearning in that it targets abstention behavior, often for safety, trustworthiness, privacy, or regulatory compliance. Refusal unlearning encompasses a class of algorithmic, training, and editing strategies enabling models to deny or abstain from answering selected prompts while retaining utility and alignment elsewhere.
1. Conceptual Foundations: Definitions and Taxonomy
Refusal unlearning differs from generic knowledge suppression or obfuscation by explicitly redefining the model’s desired response for a targeted set of prompts to a structured refusal, typically aligned with a policy or safety regime rather than simple answer suppression. The unlearning objective can be formulated as
where denotes a refusal/“empty set” output, and is the set of queries to be forgotten as refusal responses (Li et al., 26 May 2025). This framework is compatible with diverse response types (short refusals, explicit rebuttals) and spans textual, multimodal (image–text), and generative video models.
The taxonomy of refusal unlearning includes:
- Instruction tuning-based refusal: Rewriting labels to “I don’t know” for selected samples and fine-tuning accordingly (Zhu et al., 9 Oct 2024, Zhou et al., 1 Sep 2025).
- Preference optimization: Using preference pairs to favor refusals over incorrect or unsafe completions (Wang et al., 15 Dec 2024, Yoon et al., 21 May 2025).
- RL-based refusal boundary optimization: Training with on-policy RL to establish a sharp boundary between refusal and informative responses (Zhang et al., 8 Jun 2025).
- Knowledge editing as unlearning: Deploying memory-editing (“fact-editing”) methods with the refusal string as the edit (Li et al., 26 May 2025).
- Distribution-flattening: Maximizing output entropy over multiple-choice answers to induce high uncertainty and a downstream refusal on open-ended queries (Sun et al., 5 May 2025).
- Low-rank refusal vector methods: Embedding refusal mechanisms directly in generative video models via parameter updates (Facchiano et al., 9 Jun 2025).
2. Core Methodologies for Refusal Unlearning
Refusal unlearning strategies operationalize the mapping of target queries to refusals via algorithmically distinct approaches.
Data Generation and Label Construction
- Boundary-aware labeling: In multimodal models, refusal is conditioned on both extrinsic (visual evidence) and intrinsic (model capacity) boundaries, with confidence metrics guiding whether a refusal is appropriate (Wang et al., 15 Dec 2024).
- Chain-of-thought (CoT) replacement: In reasoning-heavy models, the full CoT trace for forget queries is replaced by a plausible uncertain/refusal CoT to suppress latent knowledge in all reasoning steps, not just answers (Yoon et al., 21 May 2025).
- Prompt decoupling: Decoupling harmful prompts from universal refusal by constructing additional “safe” contexts (same prompt with innocuous images) and training to answer correctly, mitigating over-forgetting (Chen et al., 18 Feb 2025).
Optimization and Training Objectives
| Methodology | Key Objective | Typical Loss Type |
|---|---|---|
| Instruction tuning (RAIT) | Map unknowns to “idk” | Cross-entropy |
| Preference optimization | Rank refusals above harms/incorrect | DPO/CA-DPO losses |
| RL refusal boundary (RULE) | Maximize reward for refusals on forget, answers on retain/border | PPO/GRPO/Reinforce++ |
| Editing as unlearning | Replace fact with refusal at memory location | ROME/MEMIT/WISE/AlphaEdit |
| Distribution flattening | Uniformize MCQ logits for target facts | KL-divergence |
| PD Loss (Prompt Decouple) | Penalize universal rejection, reward selective refusals | Cross-entropy (on decouple set) |
- Boundary-aware preference optimization (CA-DPO) weights DPO terms using model’s estimated confidence, balancing encouragement of refusals when uncertain (Wang et al., 15 Dec 2024).
- Reasoned IDK optimization for reasoning traces replaces ground-truth CoT and answer with uncertain/refusal CoT and “idk” answer, applying a combined loss for forgetting and retention (Yoon et al., 21 May 2025).
- Negative Preference Optimization (NPO) is used for unlearning refusals themselves (removing refusal on specific domain prompts) by minimizing the model's relative probability for refusal outputs (Mushtaq et al., 18 Nov 2025).
Evaluation Metrics
Several criteria are standardized for assessing refusal unlearning:
| Metric | Description |
|---|---|
| Refusal Rate (RR) | Proportion of targets receiving a refusal output |
| Accuracy (Acc) | Correct answers among non-refusals |
| Trustworthiness () | (penalizes incorrect, values refusals neutrally) (Wang et al., 15 Dec 2024) |
| Safe Answer Refusal Rate (SARR) | Over-forgetting: refusals on benign prompts derived from harmful ones (Chen et al., 18 Feb 2025) |
| Entropy on Probing MCQ | Measures uncertainty/randomness (higher better unlearning) (Sun et al., 5 May 2025) |
| Chain-of-Thought Forget Efficacy (CFE) | Stepwise metric for CoT models via token similarity (Yoon et al., 21 May 2025) |
| Pareto trade-off | Joint plot of forget (refusal) vs. retain (utility) |
3. Applications in Language, Multimodal, and Generative Models
LLMs
Refusal unlearning is applied for privacy and policy regulation (e.g., suppressing sensitive knowledge or erasing refusals in selected domains). Advanced preference and editing methods (e.g., WISE, AlphaEdit) demonstrate high fidelity in mapping forget queries to human-aligned refusals while leaving unrelated knowledge intact (Li et al., 26 May 2025). Distribution-flattening techniques such as DF-MCQ drive the model to uniformity over MCQ logits, causing the downstream completion to be a refusal rather than a hallucinated fact (Sun et al., 5 May 2025). RL-based frameworks (RULE) construct optimal refusal boundaries with minimal labeled data and strong generalization (Zhang et al., 8 Jun 2025).
Multimodal LLMs (MLLMs)
InBoL introduces systematic extrinsic/intrinsic boundary construction and hybrid confidence-driven preference optimization to yield high refusal accuracy on insufficient-evidence or over-ambitious queries, dramatically boosting trustworthiness without undue loss of helpful responses (Wang et al., 15 Dec 2024). SafeEraser’s decoupled prompt approach mitigates overshoot, so that refusals are only issued for true harmful content, not for prompts simply resembling those flagged as unsafe (Chen et al., 18 Feb 2025).
Generative Video Diffusion Models
Low-rank refusal vector embedding offers non-gradient, data-free, robust refusal unlearning in video diffusion models: concept-targeted rank-k vectors are subtracted from key layers, suppressing generation of unwanted video content (e.g., nudity, violence) and preserving overall visual fidelity (Facchiano et al., 9 Jun 2025).
4. Failure Modes and Mitigation Strategies
Refusal unlearning can introduce critical side effects if improperly executed:
- Over-refusal: Excessive abstention on queries the model could answer, often due to static or dynamic conflicts in label assignment (neighboring samples assigned contradictory supervision). Approaches like CRaFT employ certainty metrics and knowledge-flow rehearsal to filter and revise refusal assignments (Zhu et al., 9 Oct 2024), while SafeEraser explicitly penalizes over-refusal using the SARR metric (Chen et al., 18 Feb 2025).
- Hallucination tax: Reinforcement finetuning (RFT) can sharply lower refusal rates, leading to overconfident but unsupported answers. Incorporating a modest fraction (≈10%) of unanswerable examples in RFT minbatches restores epistemic humility, increasing the refusal rate from ≈0.01 to ≈0.8–0.94 while incurring little (<5 ppt) loss in answerable task accuracy (Song et al., 20 May 2025).
- Reason-based deception: Polite refusals can hide unethical policies; multi-turn evaluations reveal models may continue undesirable outputs after a refusal unless explicit rebuttals (ethical explanations) are employed (Pop et al., 27 Jun 2024).
- Emergent misalignment (EMA): Domain-specific refusal unlearning (e.g., on safety or cybersecurity) can reduce refusal rates in other safety domains, especially when those concepts are entangled in early model layers. EMA can be predicted by computing inter-concept vector cosine similarity and contained by adding cross-entropy penalties on retain sets from other domains (Mushtaq et al., 18 Nov 2025).
- Obfuscation vs. genuine unlearning: Methods that inject distractors (obfuscation) do not truly remove facts, as probing (e.g., with MCQs) can recover suppressed knowledge. True refusal unlearning maximizes entropy and refusal probability (Sun et al., 5 May 2025).
5. Comparative Analysis of Refusal Unlearning Techniques
Empirical results across domains establish refusal unlearning’s superiority to naive suppression or obfuscation. The following table summarizes key results:
| Method | Forget Rate (RR on targets) | Retain Utility | Overrefusal Control | Specialized Properties |
|---|---|---|---|---|
| InBoL (MLLM) | ↑ (49.1%) | High (87% Acc) | Confidence threshold | Boundary-aware, OOD robust (Wang et al., 15 Dec 2024) |
| RULE | ↑ (+17.5 pp vs. baselines) | Maintained | Naturalness (16.3 pp) | Pareto optimal with little data (Zhang et al., 8 Jun 2025) |
| CRaFT | ↑ (THS +3.6 OOD/ID) | Balanced (few refusals lost) | Static/dynamic conflict | Certainty+knowledge flow (Zhu et al., 9 Oct 2024) |
| Editing | ↑ (WISE > DPO/KL/GA) | Retain ≥ DPO | Improved (entity recall) | In-context refusal/gen. merged query (Li et al., 26 May 2025) |
| DF-MCQ | ↑ (92.7%) | 90+% on retain sets | Max entropy defense | True removal vs. obfuscation (Sun et al., 5 May 2025) |
| SafeEraser PD | SARR reduction (–79.5%) | Utility preserved | SARR metric | Multimodal, prevents overforgetting (Chen et al., 18 Feb 2025) |
On safety, correct unlearning of refusal on a target concept can yield significant declines in refusal on other domains if not augmented with containment strategies (Mushtaq et al., 18 Nov 2025).
6. Open Challenges and Directions
Despite significant advances, several critical challenges arise:
- Scalability and efficiency: RL and data-driven methods depend on synthetic hard negatives, which remain an open bottleneck for large unlearning tasks (Zhang et al., 8 Jun 2025).
- Generalization and robustness: Emergent misalignment and adversarial bypass (especially in generative modalities) call for systematic mechanisms for discovering and protecting entangled concepts (Facchiano et al., 9 Jun 2025, Mushtaq et al., 18 Nov 2025).
- Evaluation and certification: The field lacks unified, certified unlearning frameworks that guarantee the absence of residual knowledge under probing (Sun et al., 5 May 2025).
- Automation and adaptivity: Thresholds for confidence, concept selection, and data construction are often heuristic; adaptive multi-objective optimization remains underexplored (Zhu et al., 9 Oct 2024, Chen et al., 18 Feb 2025).
- Multi-turn and multi-modal extension: Current approaches are largely single-turn or context-limited; robust multi-turn reasoning and multi-modal alignment under unlearning require novel advances (Pop et al., 27 Jun 2024, Wang et al., 15 Dec 2024).
- Ethical and policy implications: Overbroad or insufficient refusal unlearning can result in either dangerous leakage or inaccessible systems, necessitating human-in-the-loop and governance-aware solutions (Facchiano et al., 9 Jun 2025, Mushtaq et al., 18 Nov 2025).
7. Summary and Outlook
Refusal unlearning has emerged as an essential strategy for aligning foundation models with societal and regulatory constraints. Modern implementations—spanning preference optimization, RL boundary learning, targeted memory editing, and entropy-maximizing outputs—demonstrate that it is possible to achieve high-fidelity, robust unlearning without catastrophic utility loss. Nevertheless, its safe application is complicated by entangled representations, collateral drift, and underexplored consequences in multi-domain or multi-modal deployments. Continued research will further refine evaluation protocols, adaptive control, and mechanistic interpretability to realize refusal unlearning as a principled, reliable tool for trustworthy AI deployment across modalities and domains (Wang et al., 15 Dec 2024, Yoon et al., 21 May 2025, Zhang et al., 8 Jun 2025, Mushtaq et al., 18 Nov 2025, Li et al., 26 May 2025).