Refusal Rate (RR) in Language Models
- Refusal Rate (RR) is defined as the fraction of model outputs classified as refusals, indicating a model's propensity to abstain from answering risky, ambiguous, or policy-inappropriate prompts.
- Evaluation methodologies for RR include automated filters, manual adjudication, and paired testing, providing nuanced insights into prompt-specific behavioral instability and subgroup biases.
- Limitations of RR, such as local instability and fairness concerns, necessitate the use of complementary metrics like RBE, Youden’s J, and RI for comprehensive safety and calibration assessments.
Refusal Rate (RR) quantifies the frequency with which a LLM produces a refusal in response to a given class of prompts. Across contemporary safety alignment and model evaluation research, RR has become the standard scalar metric for summarizing a model's abstention from responding to undesirable, ambiguous, or policy-inappropriate requests. Mathematically, RR is defined as the fraction of model outputs classified as refusals within a specified evaluation set, with the precise notion of "refusal" depending on the study context—ranging from hard explicit denials to non-responsive or hedged answers. However, recent research demonstrates that RR's interpretability depends on nuanced distinctions in prompt types, evaluation protocols, and behavioral instability, making it necessary to interrogate its limitations and supplement it with richer metrics.
1. Formal Definitions and Variants
The base RR metric is almost universally defined as a proportion:
where denotes the number of prompts the model refused, and is the total number of prompts evaluated (Heverin, 25 Jan 2026, Pan et al., 2 Oct 2025, Cristofano, 13 Jan 2026, O'Brien et al., 2024, Alagharu et al., 9 Mar 2026, Plaza-del-Arco et al., 9 Sep 2025, Kadadekar, 26 May 2026).
Variants, tailored to evaluation context, include:
- Strict RR: Only directly non-compliant and blank outputs (e.g. direct_refusal, non_responsive) are counted (Weidener et al., 20 May 2026).
- Lenient RR: Counts both strict refusal and softer forms such as partial compliance or hedged answers.
- Subgroup RR: Conditioned on demographic, task, risk-tier, or persona subgroup (Khorramrouz et al., 31 Oct 2025, Plaza-del-Arco et al., 9 Sep 2025, Weidener et al., 20 May 2026).
- Differential/flip rate: For paired or batched evaluations, the fraction of prompts whose refusal/non-refusal label flips across conditions (Heverin, 25 Jan 2026, Kadadekar, 26 May 2026).
In knowledge calibration and selective classification, RR at threshold is defined under a gating function: where is the predicted answerability probability for query (Ren et al., 15 Jan 2026).
2. Measurement Methodologies
Prompt Selection: Most studies distinguish between harmful ("should-refuse") and benign ("should-answer") prompts, typically sourced from curated benchmarks such as WildGuard, RefusalBench, CoCoNot, or application/domain-specific datasets (mil-deflect for military LLMs, text-to-SQL ambiguities, sociopolitical classification tasks) (Weidener et al., 20 May 2026, Fitzgerald et al., 18 Feb 2026, Plaza-del-Arco et al., 9 Sep 2025, O'Brien et al., 2024).
Outcome Annotation: Outputs are classified as refusal vs. non-refusal via:
- Automated keyword filters (e.g. “I’m sorry, I can’t assist”) supplemented by LLM-based classification (Khorramrouz et al., 31 Oct 2025, Weidener et al., 20 May 2026).
- LLM-as-a-judge or fine-tuned classifier models, typically running post-hoc over generated outputs (Alagharu et al., 9 Mar 2026, O'Brien et al., 2024).
- Manual adjudication to correct for scorer mistakes and binomial confidence intervals (Kadadekar, 26 May 2026, Weidener et al., 20 May 2026).
Batched and Paired Evaluations: Recent protocols emphasize evaluating RR under realistic serving batch conditions with paired safety/capability controls to isolate safety-specific boundary shifts from infrastructure-induced output instability (Kadadekar, 26 May 2026).
Monte Carlo Sampling: When controlling for large factorials (e.g. model × task × persona × prompt), RR is estimated as a mean over sampled combinations for sample efficiency (Plaza-del-Arco et al., 9 Sep 2025).
3. Context-Conditioned and Artifact-Dependent Instability
Empirical research demonstrates that RR, while high in aggregate for safety-aligned models (often >90% refusal on harmful prompts), exhibits substantial local instability under prompt perturbation, demographic targeting, and infrastructure variation:
- Prompt Injection and Perturbation: Even with aggregate RR >94%, 27.7–31.8% of prompts exhibited at least one "refusal escape" in their perturbation neighborhood, with local "flip rates" exceeding 20% for certain artifact types (e.g. ransomware text), but 0% for executable malware (Heverin, 25 Jan 2026).
- Demographic and Persona Effects: Subgroup RR varies across national, gender, religious, and sexual-orientation identities by up to 1.5×, and intersectional groups can amplify refusal disparity (Khorramrouz et al., 31 Oct 2025, Plaza-del-Arco et al., 9 Sep 2025).
- Serving Stack and Batch Instability: Paired testing under different batch/microbatch schedulers yields a non-trivial differential RR (median corrected flip-rate ≈0.16%), with most flips traceable to output instability rather than alignment-type (Kadadekar, 26 May 2026).
- Safety–Capability Interference: Activation- or feature-level refusal interventions (e.g. ablation, SAE, vector steering) can sharply reduce RR but often induce distribution drift or collateral performance regressions if not properly disentangled (Cristofano, 13 Jan 2026, O'Brien et al., 2024, Alagharu et al., 9 Mar 2026).
4. Statistical Analyses and Supplementary Metrics
Regression and Hypothesis Testing:
- RR is often analyzed with chi-square tests (e.g. outcome vs. perturbation family or demographic group), binary/multinomial logistic regression, and mixed-effects or GEE models to account for prompt clustering (Heverin, 25 Jan 2026, Khorramrouz et al., 31 Oct 2025).
- Provider-level odds ratios (OR) quantify the effect of vendor/API on refusal propensity (e.g. Anthropic API: OR=21.03 for strict refusal) (Weidener et al., 20 May 2026).
Complementary Metrics:
- Refusal Boundary Entropy (RBE): Shannon entropy over local refusal/compliance outcomes within a perturbation set, quantifying instability of the refusal boundary (Heverin, 25 Jan 2026).
- Youden’s J: , operationalizing risk-tier discrimination (Weidener et al., 20 May 2026).
- Refusal Index (RI): Spearman’s rank correlation between refusal probability and error probability, decoupling blanket RR from knowledge-aware or risk-aligned refusal (Pan et al., 2 Oct 2025).
- Disparity Ratio/Absolute Difference: Ratios or absolute differences between subgroup RRs for measuring selective refusal bias (Khorramrouz et al., 31 Oct 2025).
5. Quantitative Results Across Domains
| Model/Context | Harmful RR (%) | Benign RR (%) | Partial Compliance (%) | Soft Deflection / Other | Notes |
|---|---|---|---|---|---|
| GPT-4o (Heverin, 25 Jan 2026) | 95.6 | — | 0.98 | — | 27.7% instability on base prompts; RBE=0.293b (norm. 0.185) |
| GPT-4.1 (Heverin, 25 Jan 2026) | 94.7 | — | 1.70 | — | 31.8% instability; RBE=0.346b (norm. 0.218) |
| min(Llama3.2-3B, Qwen2.5) | ~0 | ~0 | — | — | Near-zero RR for latest models in classification tasks (Plaza-del-Arco et al., 9 Sep 2025) |
| Nova 2 Lite (mil. gold) | 98.2 | — | — | — | Max hard refusal on mil-deflect-gold-alpha (Fitzgerald et al., 18 Feb 2026) |
| Deepseek R1 (mil. gold) | 25.8 | — | — | — | Best answerer in mil domain, low RR (Fitzgerald et al., 18 Feb 2026) |
| Grok 4.20 (bio) | 3 (benign) | 81.7 (dual) | — | — | Youden’s J=0.787 (best-risk discrimination) (Weidener et al., 20 May 2026) |
| Claude Opus 4.7 (bio) | 76.6 (benign) | 100 (dual) | — | — | J=0.234; over-refusal |
| Qwen3-VL-8B SRA | 2.0 | — | — | — | SRA: RR drops from 93.8→2.0% without drift (Cristofano, 13 Jan 2026) |
Concrete RR numbers are highly protocol- and domain-dependent. For example, in military QA, refusal rates span 1.5%–98.2% across different public models, with even military-tuned models showing high pre-ablation RR (Fitzgerald et al., 18 Feb 2026). In sociopolitical tasks, RR on offensive classification can range from 0–87% (by model generation) (Plaza-del-Arco et al., 9 Sep 2025).
6. Limitations, Biases, and Misranking
- Binary RR metrics can mis-rank models on calibration and safety objectives. For instance, a model may achieve a high RR by over-refusing even benign prompts, yielding low risk-tier discrimination (Youden’s J) and poor utility (Weidener et al., 20 May 2026).
- Selective refusal bias: Unequal RR across demographic subgroups creates fairness concerns and loopholes in safety systems; intersectional identities can exacerbate or invert group disparities (Khorramrouz et al., 31 Oct 2025, Plaza-del-Arco et al., 9 Sep 2025).
- Task framing, output format, and prompt wording impact RR: Forced-choice answer formats and stricter prompts typically depress RR, sometimes by dozens of percentage points, illustrating the interaction between system design and measured safety (Plaza-del-Arco et al., 9 Sep 2025, Weidener et al., 20 May 2026).
- Batch serving and system infrastructure can cause low-rate, but critical, refusal boundary flips. Directional flips (e.g. from refusal to non-refusal) may occur due to scheduler, micro-batching, kernel settings, or hardware-specific effects, with real safety implications (Kadadekar, 26 May 2026).
7. Operational Recommendations and Future Directions
- Always report RR alongside calibration and discrimination metrics: Standalone RR should be complemented by measures such as Youden’s J, RBE, RI, and breakdowns by partial or soft compliance (Pan et al., 2 Oct 2025, Weidener et al., 20 May 2026, Heverin, 25 Jan 2026).
- Stratify RR reporting by subgroup and risk tier: Ensure fairness and safety by conditioning RR on demographic, domain, and artifact-type axes (Khorramrouz et al., 31 Oct 2025, Plaza-del-Arco et al., 9 Sep 2025, Weidener et al., 20 May 2026).
- Incorporate robust, paired evaluation protocols: Compare RR under multiple serving configurations, paired with capability and benign controls, and use manual adjudication for low-rate events (Kadadekar, 26 May 2026).
- Use activation-geometry-guided interventions: To drive target RR reductions without collateral damage, methods like spectral residualization (SRA) or decorrelated residual steering can be preferred over naïve vector ablation (Cristofano, 13 Jan 2026, Alagharu et al., 9 Mar 2026).
- Refusal stability and local decision-boundary behavior should be explicitly measured: Audit RR distributions within perturbation neighborhoods or local prompt clusters to surface "refusal escapes" and compliance flips (Heverin, 25 Jan 2026).
- Promote mechanistic interpretability: Identification and disentanglement of refusal-mediating features (e.g. categorical steering, SAE features) are critical to avoid tradeoffs with general capabilities (O'Brien et al., 2024, Alagharu et al., 9 Mar 2026).
In summary, while refusal rate is a foundational metric for LLM safety and policy evaluation, its utility as a sole or ranking criterion is limited by artifact-dependence, local instability, fairness concerns, and infrastructural confounds. A comprehensive safety assessment requires RR to be embedded within a battery of analytic, stratified, and stability-sensitive tests, as established in leading safety research from the 2024–2026 era (Heverin, 25 Jan 2026, Pan et al., 2 Oct 2025, Weidener et al., 20 May 2026).