RagRefuse Benchmark: Assessing Over-Refusal in RAG
- RagRefuse is a benchmark that systematically quantifies over-refusal in RAG systems by manipulating harmful context contamination and user intent.
- It employs a factorial design across six high-safety domains with varying contamination patterns (e.g., BBB, HBH) to evaluate model robustness.
- Empirical findings demonstrate that interventions like SafeRAG-Steering significantly reduce over-refusal rates, guiding refined safety filter calibration.
RagRefuse is a domain-stratified benchmark specifically designed to quantify and analyze the phenomenon of over-refusal in retrieval-augmented generation (RAG) systems. Over-refusal describes the tendency of LLMs, particularly those aggressively aligned for safety, to decline benign user requests due to triggered safety filters—especially when harmful content contaminates the retrieval context, regardless of innocent user intent. The RagRefuse benchmark systematically manipulates both query intent and context contamination, providing a controlled setting for evaluating model robustness, refusal calibration, and mitigation strategies within RAG pipelines (Maskey et al., 12 Oct 2025).
1. Conceptual Foundation and Motivation
RagRefuse targets the intersection of safety alignment and retrieval bias in RAG. The safety alignment of LLMs is known to cause “over-refusals,” where benign prompts are inappropriately refused, often due to overly sensitive or context-insensitive safety filters. RAG systems exacerbate this because retrieved external content—sourced independently of the user’s original intent—may inject harmful passages (hereafter “context contamination”) that directly influence model outputs. This motivates a benchmark capable of disentangling the interacting roles of user intent, context contamination arrangement, and harmful-content density (denoted ) in triggering refusal behaviors.
Key terminology includes:
- Over-refusal: Refusal issued (direct or indirect) by the model for a benign query.
- Context contamination: The presence of harmful context(s) among otherwise benign ones, operationalized through mixture patterns (e.g., BBB, HBH for ).
- Harmful-text density (): Fraction of harmful contexts in the retrieved bundle, i.e., , with the total number of retrieved passages.
2. Benchmark Construction and Data Structure
RagRefuse comprises 2,970 samples (2,475 train, 495 test), stratified across six domains of high safety relevance: Medical, Chemical, Cybersecurity, Legal, Financial, and Other (general harmful content). Harmful queries and answers are sourced from LLM-LAT and AdvBench; benign counterparts are LLM-generated, stepwise safe explanations in the same domains.
For each query (classified as “benign” or “harmful”), nearest contexts () are retrieved using Sentence-BERT similarity. Context bundles are formed by enumerating all Boolean strings of length indicating benign (B) or harmful (H) membership, generating 15 distinct contamination patterns per 0. This factorial design enables controlled variation of contamination ratio 1 and harmful-text arrangement (e.g., BHB vs. BBH). Within the test set, each 2 and pattern is approximately balanced.
| Domain | Test Bundles | Benign Queries | Harmful Queries |
|---|---|---|---|
| Cybersecurity | 90 | – | – |
| Chemical | 90 | – | – |
| Financial | 90 | – | – |
| Legal | 75 | – | – |
| Other | 75 | – | – |
| Medical | 75 | – | – |
Test split: 263 benign, 232 harmful queries; 165 bundles per 3; 33 bundles per pattern/ 4 so that 5.
3. Evaluation Protocol and Metrics
Each RAG model output 6 is assessed by a high-capacity LLM and classified as one of:
- Direct answer
- Direct refusal
- Indirect refusal
A binary indicator for over-refusal is defined as
7
where 8 is the prompt. The core metrics are:
- Over-refusal rate (ORR) on benign prompts 9:
0
- Refusal rate (RR1) on harmful prompts 2: Analogous formula over the harmful prompt set.
Experimental axes include contamination pattern (specific B/H arrangement), contamination density 3, context length 4, and domain. This protocol enables fine-grained analysis of how refusal probability depends jointly on these variables.
Users are instructed to:
- Load RagRefuse prompts and context bundles.
- For each 5, invoke the target RAG pipeline.
- Classify outputs into {direct answer, direct refusal, indirect refusal} using the LLM-based judge.
- Compute ORR and RR6 conditioned on experiment factors.
4. Empirical Findings and Domain/Model Sensitivities
RagRefuse reveals several distinctive trends in over-refusal phenomena:
- Contamination Density Effect: On Llama-3.1-8B-Instruct, ORR for uncontaminated bundles (BBB; 7) is near zero. Introduction of a single harmful context sharply increases ORR; e.g., BHB (8) doubles ORR compared to cases with the harmful context at the end. Scaling 9 from 3 to 7, while holding contamination pattern fixed, increases ORR by approximately 20–25 percentage points. This demonstrates that both contamination density and bundle size amplify over-refusal.
- Domain Sensitivity: Chemical and Medical queries induce the highest over-refusal rates, followed by Legal, Financial, Other, and Cybersecurity. Thus, domain characteristics mediate susceptibility to over-refusal in the presence of contamination.
- Model-Specific Behavior: Llama-3.1-8B-Instruct exhibits a high base ORR (53.4%) on benign queries with context contamination, whereas Qwen-1.5-7B-Instruct’s ORR is substantially lower (4.7%) and nearly insensitive to contamination level or pattern.
- Mitigation Results (SafeRAG-Steering): A model-centric embedding intervention (“SafeRAG-Steering”) constrains embeddings at inference toward empirically safe, non-refusing regions. Application to Llama reduces ORR from 53.4% to 4.3%; on Qwen, the residual 4.7% is eliminated (0%). Improvements are most prominent for mixed contamination patterns and those domains with higher baseline ORR (Chemical, Medical, Legal).
5. Benchmark Usage and Reproducibility
RagRefuse is publicly available:
- Data: Hugging Face dataset repository
- Code and evaluation scripts: GitHub repository
Adhering to the prescribed benchmarking steps ensures reproducibility and direct comparability:
- Output tables: ORR stratified by contamination pattern, domain, and 0.
- Plots: ORR as a function of 1.
- Quantitative summaries: Model robustness to context contamination; data to inform safety filter calibration.
The protocol enables systematic probing for domain and density vulnerabilities, as well as direct evaluation of mitigation strategies (e.g., SafeRAG-Steering), within controlled and well-documented experimental settings.
6. Context within the RAG Safety and Refusal Benchmark Landscape
While RagRefuse addresses over-refusal via domain and contamination manipulations, complementary RAG refusal benchmarks exist. For example, RefusalBench (Muhamed et al., 12 Oct 2025) employs generative perturbation (over 176 linguistic “levers”) to introduce controlled uncertainty, evaluating selective refusal detection and categorization across uncertainty types and intensity tiers—distinct from RagRefuse’s static, context contamination-driven design. GaRAGe (Sorodoc et al., 9 Jun 2025) provides grounding and deflection annotations to analyze both answer grounding fidelity and safe deflection when retrieval fails, with metrics such as relevance-aware factuality and true/false positive refusal rates.
RagRefuse’s systematic contamination patterns and comprehensive domain coverage make it particularly well suited for dissecting the failure modes and sensitivity of RAG systems to over-refusal. The deployment of SafeRAG-Steering, and the demonstration of both domain and model-specific effects, highlight new directions for mitigation and tailored safety alignment in live RAG pipelines.
7. Limitations and Practical Considerations
RagRefuse, by design, is focused on English prompts and a fixed set of domains, so results may not generalize to other languages or out-of-domain content. Evaluation is predicated on a single high-capacity LLM judge, potentially introducing subtle labeling bias. The data structure explicitly controls contamination level and arrangement, so transferability to real-world, less structured retrieval settings should be evaluated separately.
The benchmark’s value lies in its capacity to dissect over-refusal as a joint function of context arrangement, density, domain, and alignment regimen, supporting both diagnostic studies of model safety and comparative assessment of refusal-mitigation interventions (Maskey et al., 12 Oct 2025).