Over-Refusal Phenomenon in AI Systems

Updated 25 June 2026

Over-refusal is a phenomenon where safety-aligned AI models mistakenly refuse benign queries due to shared superficial risky cues.
Research quantifies over-refusal using metrics such as ORR and Trade-off Score to balance safety and utility across various contexts.
Mitigation strategies focus on fine-tuning and adaptive decoding methods to adjust model behavior without compromising essential safety filters.

Over-refusal is the systematic phenomenon in which a safety-aligned LLM, text-to-image (T2I) model, or retrieval-augmented system erroneously refuses to respond to genuinely benign user prompts. This behavior represents a false-positive safety error where harmless content, often sharing superficial cues with truly harmful queries, is incorrectly flagged as unsafe—diminishing utility and frustrating users. Over-refusal is pervasive across a breadth of model architectures, domains (health, legal, biosecurity, psychotherapy), and modalities, and has become a central focus in both the measurement and mitigation of alignment trade-offs.

1. Formal Definitions and Quantitative Metrics

Over-refusal is operationalized as the fraction of benign prompts in a test set for which the model issues a refusal response. In LLMs, a refusal is typically marked by the presence of refusal keywords or phrases (e.g., "I'm sorry, I can't…"), as identified by automated judges or external moderators such as WildGuard. The metrics most commonly used are:

Over-Refusal Rate (ORR):

$\mathrm{ORR} = \frac{\#\{\text{benign prompts refused}\}}{\#\{\text{benign prompts}\}}$

Refusal Rate (RR) on Harmful Prompts:

$\mathrm{RR} = \frac{\#\{\text{harmful prompts refused}\}}{\#\{\text{harmful prompts}\}}$

Trade-off Score (Editor’s term):

$\mathrm{TradeoffScore} = \frac{1}{2} \left(\text{ComplianceRate} + \text{SafetyScore}\right)$

Benchmarks such as OR-Bench (Cui et al., 2024), ORFuzzSet (Zhang et al., 15 Aug 2025), MORBench (Pan et al., 23 May 2025), and Health-ORSC-Bench (Zhang et al., 25 Jan 2026) systematically quantify ORR across rejection categories and model families. In T2I, OVERT measures over-refusal as the rate at which benign image prompts return a refusal or blank output (Cheng et al., 27 May 2025).

Table: Over-Refusal Rate Metrics

Metric	Formula	Typical Context
Over-Refusal Rate (ORR)	$\frac{\#(\text{benign refused})}{\#(\text{benign total})}$	LLM, T2I, Retrieval
Safety Refusal Rate (RR)	$\frac{\#(\text{harmful refused})}{\#(\text{harmful total})}$	LLM, T2I, Retrieval
Trade-off Score	$\frac{1}{2}(\text{Compliance} + \text{Safety})$	LLM alignment evaluation

The robust quantification of ORR is essential for tracking the delicate safety–helpfulness balance in model deployments.

2. Mechanistic Origins: Representation Geometry and Safety Boundaries

Contemporary research consistently locates over-refusal at the representational boundaries between danger and safety in model hidden states. In aligned LLMs, harmful-refusal responses aggregate along a single, global direction in activation space, whereas over-refusal lies in a high-dimensional, task-dependent subspace embedded within benign task clusters (Maskey et al., 29 Mar 2026).

Single-Direction Model Inadequacy: Linear steering along a single refusal direction modulates both RR and ORR but cannot selectively reduce over-refusal without degrading core safety, as all refusal-related directions share a one-dimensional trade-off (Joad et al., 2 Feb 2026). Empirically, $\mathrm{RR}(\alpha)$ and $\mathrm{ORR}(\alpha)$ collapse to a sigmoid as the steering strength $\alpha$ increases.
Safety Decision Boundary: Over-refusal samples reside near the decision boundary of linear or non-linear classifiers in hidden-space (Pan et al., 23 May 2025, Zhang et al., 24 Nov 2025). Small perturbations due to safety fine-tuning may tip benign representations across the threshold.
Refusal Triggers: Over-refusal is often driven by linguistic patterns—"refusal triggers"—shared by both harmful and sanitized prompts, causing models to over-generalize refusal responses to innocuous queries (Xue et al., 12 Mar 2026).
Representation-Space Diagnostics: Sparse autoencoder (SAE) auditing and probe-based geometry reveal that shallow refusals or format-induced boundary crossings (e.g., chat-template tokens) can produce brittle refusal depth, with high divergence between surface labels and latent activations (DeLeeuw, 28 May 2026).

3. Empirical Manifestations: Benchmarks, Modalities, and Trade-offs

Large-scale benchmarks consistently confirm the prevalence and variability of over-refusal:

LLMs: Frontier LLMs (e.g., GPT-5, Llama-4) exhibit ORR exceeding 60%–80% on "hard" benign prompts in health, privacy, and legal contexts, especially for prompts that combine risky surface cues with benign semantics (Zhang et al., 25 Jan 2026, Cui et al., 2024, Zhang et al., 15 Aug 2025).
Text-to-Image Models: T2I systems return blank or masked outputs for up to 74% of benign prompts in NSFW or privacy-adjacent categories, demonstrating alignment-induced utility loss (Cheng et al., 27 May 2025).
Retrieval-Augmented and RAG Models: Over-refusal in RAG pipelines is tightly correlated with contamination of retrieved context, harmful-text density, or domain cues, even against benign queries (Maskey et al., 12 Oct 2025, Zhou et al., 1 Sep 2025).

Notably, performance varies with model size, domain specialization, and architecture (e.g., dense vs. MoE). A recurring empirical pattern is a strong positive correlation between ORR and safety metrics: increased safety coverage almost invariably incurs higher rates of over-refusal (Cui et al., 2024, Zhang et al., 25 Jan 2026).

Table: Representative Over-Refusal Rates Across Domains

Model/Domain	Benchmark	Over-Refusal Rate	Contextual Factor
GPT-5 (health)	Health-ORSC-Bench	66.8%	"Hard-1K" benign prompts
Imagen-3 (T2I)	OVERT	68% (sexual content)	NSFW benign prompts
Llama-3.1-8B-instruct	RagRefuse	53.4% (base) → 4.3%*	Contaminated RAG context; *
Qwen 2.5 1.5B (bio)	BioRefusalAudit	83%	Benign biology queries

*With SafeRAG-Steering applied.

4. Mechanistic and Algorithmic Mitigation Strategies

Efforts to remediate over-refusal target both representation-level mechanisms and inference-time interventions:

Representation-Level Fine-Tuning: Methods such as ACTOR (Dabas et al., 6 Jul 2025), MOSR (Zhang et al., 24 Nov 2025), and trigger-aware supervised fine-tuning (Xue et al., 12 Mar 2026) adjust model activations or data selection to shift only the minimal necessary components of the refusal geometry, reducing ORR without sacrificing safety rates.
Adaptive Decoding and Steering: Inference-time approaches—Adaptive Contrastive Decoding (AdaCD) (Qi et al., 18 Apr 2026), SafeRAG-Steering (Maskey et al., 12 Oct 2025), category-specific steering (Alagharu et al., 9 Mar 2026)—exploit the structure of refusal token distributions or activation subspaces to dynamically boost or suppress refusal logit probabilities and avoid one-size-fits-all suppression.
Dataset Rebalancing and Evolutionary Curation: Benchmark-augmented alignment (e.g., with EvoRefuse-ALIGN (Wu et al., 29 May 2025), ORFuzzSet (Zhang et al., 15 Aug 2025)) fine-tunes models on hard boundary cases, teaching more discriminative safety decision boundaries.
Depth and Fidelity-Aware Auditing: Activation-level metrics (e.g., divergence score $D$ in BioRefusalAudit) identify shallow, "fake" refusals that can vanish under minor format changes, promoting multi-frame robustness audits (DeLeeuw, 28 May 2026).
Clinical/Support Contexts: In psychological support and mental health, over-refusal is mitigated by structured, context-sensitive refusals (e.g., PsychoSafe (Barmina et al., 8 Jun 2026)) and dynamic, multi-phase guidance that prioritizes user needs over hard denials (Tang et al., 2 Feb 2026).

These methods collectively reveal that global, task-agnostic interventions offer coarse trade-off control, but fine-grained, task-conditioned, or feature-targeted approaches are required for selective mitigation of over-refusal without undermining safety performance (Maskey et al., 29 Mar 2026).

5. Practical Implications, Application Domains, and Harms

Utility Degradation: High ORR impairs LLM usability for coding, scientific, legal, medical, or creative tasks where queries may superficially mimic sensitive topics (Cui et al., 2024, Wu et al., 29 May 2025, Zhang et al., 25 Jan 2026).
User Experience: Over-refusal is acutely detrimental in support settings; e.g., in mental health, insensitively broad refusals reduce perceived trust, increase frustration, and can worsen outcomes for vulnerable users (Tang et al., 2 Feb 2026, Barmina et al., 8 Jun 2026).
Safety-Utility Tension: Empirical studies demonstrate Spearman rank correlations $\mathrm{RR} = \frac{\#\{\text{harmful prompts refused}\}}{\#\{\text{harmful prompts}\}}$ 0 between safety rates and ORR, confirming that most models achieve high safety by trading away helpfulness. No evaluated LLM lies in the ideal region of simultaneously high safety and low over-refusal (Zhang et al., 25 Jan 2026, Cui et al., 2024, Zhang et al., 15 Aug 2025).
Domain Bias and Societal Impact: Refusal patterns often track cultural and legal salience rather than true hazard (e.g., psilocybin cultivation or spiritual questions), compounding biases and causing inequitable user experiences (DeLeeuw, 28 May 2026).

Table: Illustrative Model Family Effects

Family/Model	Over-Refusal Rate (ORR)	Toxic Rejection Rate	Characteristic
GPT-5 / Llama-4	Up to 80%	>90%	"Safety-pessimistic"
Qwen-3-Next (MoE)	~0%	Lower	"High-utility, low-safe"
Meditron-7B	<10%	Vulnerable	Compliant, incomplete

6. Diagnostic and Benchmarking Methodologies

A comprehensive suite of benchmarks and testing protocols expose over-refusal:

OR-Bench, EvoRefuse-Test, ORFuzzSet, OVERT: Large-scale, finely stratified datasets of pseudo-malicious prompts challenge models to distinguish genuinely harmful from superficially risky but harmless instructions (Cui et al., 2024, Wu et al., 29 May 2025, Zhang et al., 15 Aug 2025, Cheng et al., 27 May 2025).
Boundary-Proximal Selection: Algorithms such as RASS select boundary-aligned benign prompts, enabling precise probing and targeted calibration (Pan et al., 23 May 2025).
Human-Aligned Judge Models: Automated annotators such as WildGuard, Grok-4, and fine-tuned judges like OR-Judge approximate human perceptions of toxicity and refusal (Zhang et al., 15 Aug 2025, Zhang et al., 25 Jan 2026).
Activation- and Subspace-Based Probing: Sparse autoencoder (SAE) and PCA analyses, linear probes, and task-conditioned ablation illuminate the geometric structure and dimensionality of over-refusal (DeLeeuw, 28 May 2026, Maskey et al., 29 Mar 2026).

Best practices require benchmarking both over-refusal and harmful-prompt rejection in tandem, with per-category and per-intent stratification, multi-lingual representation, and scenario-based user studies.

7. Systemic Recommendations and Open Challenges

Selective Mitigation: Linear steering or ablation alone is insufficient; interventions must respect the multi-dimensionality and task-dependence of over-refusal phenomena (Joad et al., 2 Feb 2026, Maskey et al., 29 Mar 2026).
Boundary-Adaptive Training: Training strategies that target samples near the safety decision boundary—especially in low-resource languages or complex domains—offer more robust safety–utility calibration (Pan et al., 23 May 2025).
Context- and User-Aware Refusals: Adaptive refusal frameworks (e.g., PsychoSafe) and phased, support-preserving refusals can mitigate user-experience harms, especially in sensitive contexts such as mental health (Barmina et al., 8 Jun 2026, Tang et al., 2 Feb 2026).
Continuous Monitoring: Integration of activation-level auditing and over-refusal tracking into risk assessments and deployment pipelines enables early identification of drift or emergent over-refusal modes (DeLeeuw, 28 May 2026).
Limitations: Most current mitigations are focused on single-turn refusal; multi-turn dialogue effects, cross-modality transfer, and adversarial prompt evolution remain insufficiently addressed.

The ongoing challenge in over-refusal research is to reconcile rigorous harm prevention with model utility, transparent reasoning, and user trust. Advances in detailed geometry-aware alignment, domain-cognizant adaptation, and human-centered refusal design are central to the next generation of safe and broadly useful AI systems.