Papers
Topics
Authors
Recent
Search
2000 character limit reached

Over-Refusal Phenomenon in AI Systems

Updated 25 June 2026
  • Over-refusal is a phenomenon where safety-aligned AI models mistakenly refuse benign queries due to shared superficial risky cues.
  • Research quantifies over-refusal using metrics such as ORR and Trade-off Score to balance safety and utility across various contexts.
  • Mitigation strategies focus on fine-tuning and adaptive decoding methods to adjust model behavior without compromising essential safety filters.

Over-refusal is the systematic phenomenon in which a safety-aligned LLM, text-to-image (T2I) model, or retrieval-augmented system erroneously refuses to respond to genuinely benign user prompts. This behavior represents a false-positive safety error where harmless content, often sharing superficial cues with truly harmful queries, is incorrectly flagged as unsafe—diminishing utility and frustrating users. Over-refusal is pervasive across a breadth of model architectures, domains (health, legal, biosecurity, psychotherapy), and modalities, and has become a central focus in both the measurement and mitigation of alignment trade-offs.

1. Formal Definitions and Quantitative Metrics

Over-refusal is operationalized as the fraction of benign prompts in a test set for which the model issues a refusal response. In LLMs, a refusal is typically marked by the presence of refusal keywords or phrases (e.g., "I'm sorry, I can't…"), as identified by automated judges or external moderators such as WildGuard. The metrics most commonly used are:

  • Over-Refusal Rate (ORR):

ORR=#{benign prompts refused}#{benign prompts}\mathrm{ORR} = \frac{\#\{\text{benign prompts refused}\}}{\#\{\text{benign prompts}\}}

RR=#{harmful prompts refused}#{harmful prompts}\mathrm{RR} = \frac{\#\{\text{harmful prompts refused}\}}{\#\{\text{harmful prompts}\}}

  • Trade-off Score (Editor’s term):

TradeoffScore=12(ComplianceRate+SafetyScore)\mathrm{TradeoffScore} = \frac{1}{2} \left(\text{ComplianceRate} + \text{SafetyScore}\right)

Benchmarks such as OR-Bench (Cui et al., 2024), ORFuzzSet (Zhang et al., 15 Aug 2025), MORBench (Pan et al., 23 May 2025), and Health-ORSC-Bench (Zhang et al., 25 Jan 2026) systematically quantify ORR across rejection categories and model families. In T2I, OVERT measures over-refusal as the rate at which benign image prompts return a refusal or blank output (Cheng et al., 27 May 2025).

Table: Over-Refusal Rate Metrics

Metric Formula Typical Context
Over-Refusal Rate (ORR) #(benign refused)#(benign total)\frac{\#(\text{benign refused})}{\#(\text{benign total})} LLM, T2I, Retrieval
Safety Refusal Rate (RR) #(harmful refused)#(harmful total)\frac{\#(\text{harmful refused})}{\#(\text{harmful total})} LLM, T2I, Retrieval
Trade-off Score 12(Compliance+Safety)\frac{1}{2}(\text{Compliance} + \text{Safety}) LLM alignment evaluation

The robust quantification of ORR is essential for tracking the delicate safety–helpfulness balance in model deployments.

2. Mechanistic Origins: Representation Geometry and Safety Boundaries

Contemporary research consistently locates over-refusal at the representational boundaries between danger and safety in model hidden states. In aligned LLMs, harmful-refusal responses aggregate along a single, global direction in activation space, whereas over-refusal lies in a high-dimensional, task-dependent subspace embedded within benign task clusters (Maskey et al., 29 Mar 2026).

  • Single-Direction Model Inadequacy: Linear steering along a single refusal direction modulates both RR and ORR but cannot selectively reduce over-refusal without degrading core safety, as all refusal-related directions share a one-dimensional trade-off (Joad et al., 2 Feb 2026). Empirically, RR(α)\mathrm{RR}(\alpha) and ORR(α)\mathrm{ORR}(\alpha) collapse to a sigmoid as the steering strength α\alpha increases.
  • Safety Decision Boundary: Over-refusal samples reside near the decision boundary of linear or non-linear classifiers in hidden-space (Pan et al., 23 May 2025, Zhang et al., 24 Nov 2025). Small perturbations due to safety fine-tuning may tip benign representations across the threshold.
  • Refusal Triggers: Over-refusal is often driven by linguistic patterns—"refusal triggers"—shared by both harmful and sanitized prompts, causing models to over-generalize refusal responses to innocuous queries (Xue et al., 12 Mar 2026).
  • Representation-Space Diagnostics: Sparse autoencoder (SAE) auditing and probe-based geometry reveal that shallow refusals or format-induced boundary crossings (e.g., chat-template tokens) can produce brittle refusal depth, with high divergence between surface labels and latent activations (DeLeeuw, 28 May 2026).

3. Empirical Manifestations: Benchmarks, Modalities, and Trade-offs

Large-scale benchmarks consistently confirm the prevalence and variability of over-refusal:

Notably, performance varies with model size, domain specialization, and architecture (e.g., dense vs. MoE). A recurring empirical pattern is a strong positive correlation between ORR and safety metrics: increased safety coverage almost invariably incurs higher rates of over-refusal (Cui et al., 2024, Zhang et al., 25 Jan 2026).

Table: Representative Over-Refusal Rates Across Domains

Model/Domain Benchmark Over-Refusal Rate Contextual Factor
GPT-5 (health) Health-ORSC-Bench 66.8% "Hard-1K" benign prompts
Imagen-3 (T2I) OVERT 68% (sexual content) NSFW benign prompts
Llama-3.1-8B-instruct RagRefuse 53.4% (base) → 4.3%* Contaminated RAG context; *
Qwen 2.5 1.5B (bio) BioRefusalAudit 83% Benign biology queries

*With SafeRAG-Steering applied.

4. Mechanistic and Algorithmic Mitigation Strategies

Efforts to remediate over-refusal target both representation-level mechanisms and inference-time interventions:

These methods collectively reveal that global, task-agnostic interventions offer coarse trade-off control, but fine-grained, task-conditioned, or feature-targeted approaches are required for selective mitigation of over-refusal without undermining safety performance (Maskey et al., 29 Mar 2026).

5. Practical Implications, Application Domains, and Harms

  • Utility Degradation: High ORR impairs LLM usability for coding, scientific, legal, medical, or creative tasks where queries may superficially mimic sensitive topics (Cui et al., 2024, Wu et al., 29 May 2025, Zhang et al., 25 Jan 2026).
  • User Experience: Over-refusal is acutely detrimental in support settings; e.g., in mental health, insensitively broad refusals reduce perceived trust, increase frustration, and can worsen outcomes for vulnerable users (Tang et al., 2 Feb 2026, Barmina et al., 8 Jun 2026).
  • Safety-Utility Tension: Empirical studies demonstrate Spearman rank correlations RR=#{harmful prompts refused}#{harmful prompts}\mathrm{RR} = \frac{\#\{\text{harmful prompts refused}\}}{\#\{\text{harmful prompts}\}}0 between safety rates and ORR, confirming that most models achieve high safety by trading away helpfulness. No evaluated LLM lies in the ideal region of simultaneously high safety and low over-refusal (Zhang et al., 25 Jan 2026, Cui et al., 2024, Zhang et al., 15 Aug 2025).
  • Domain Bias and Societal Impact: Refusal patterns often track cultural and legal salience rather than true hazard (e.g., psilocybin cultivation or spiritual questions), compounding biases and causing inequitable user experiences (DeLeeuw, 28 May 2026).

Table: Illustrative Model Family Effects

Family/Model Over-Refusal Rate (ORR) Toxic Rejection Rate Characteristic
GPT-5 / Llama-4 Up to 80% >90% "Safety-pessimistic"
Qwen-3-Next (MoE) ~0% Lower "High-utility, low-safe"
Meditron-7B <10% Vulnerable Compliant, incomplete

6. Diagnostic and Benchmarking Methodologies

A comprehensive suite of benchmarks and testing protocols expose over-refusal:

Best practices require benchmarking both over-refusal and harmful-prompt rejection in tandem, with per-category and per-intent stratification, multi-lingual representation, and scenario-based user studies.

7. Systemic Recommendations and Open Challenges

  • Selective Mitigation: Linear steering or ablation alone is insufficient; interventions must respect the multi-dimensionality and task-dependence of over-refusal phenomena (Joad et al., 2 Feb 2026, Maskey et al., 29 Mar 2026).
  • Boundary-Adaptive Training: Training strategies that target samples near the safety decision boundary—especially in low-resource languages or complex domains—offer more robust safety–utility calibration (Pan et al., 23 May 2025).
  • Context- and User-Aware Refusals: Adaptive refusal frameworks (e.g., PsychoSafe) and phased, support-preserving refusals can mitigate user-experience harms, especially in sensitive contexts such as mental health (Barmina et al., 8 Jun 2026, Tang et al., 2 Feb 2026).
  • Continuous Monitoring: Integration of activation-level auditing and over-refusal tracking into risk assessments and deployment pipelines enables early identification of drift or emergent over-refusal modes (DeLeeuw, 28 May 2026).
  • Limitations: Most current mitigations are focused on single-turn refusal; multi-turn dialogue effects, cross-modality transfer, and adversarial prompt evolution remain insufficiently addressed.

The ongoing challenge in over-refusal research is to reconcile rigorous harm prevention with model utility, transparent reasoning, and user trust. Advances in detailed geometry-aware alignment, domain-cognizant adaptation, and human-centered refusal design are central to the next generation of safe and broadly useful AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Over-refusal Phenomenon.