MIRAGE-Bench: RAG Performance Benchmark

Updated 15 June 2026

MIRAGE-Bench is a metric-intensive benchmark designed to evaluate and diagnose RAG system performance by disentangling the roles of retrieval and generation.
The dataset comprises 7,560 QA pairs linked to 37,800 document chunks and employs a multi-layered filtering process with both automated and human validations.
It introduces four adaptability metrics—Noise Vulnerability, Context Acceptability, Context Insensitivity, and Context Misinterpretation—to pinpoint specific retrieval-LLM integration failures and strengths.

MIRAGE-Bench refers to a suite of benchmarks—each technically distinct—sharing the MIRAGE-Bench name across recent literature in natural language processing, vision-language intelligence, image editing, quantum benchmarking, and secure mobile agent evaluation. This article focuses on MIRAGE-Bench as introduced in the context of metric-intensive retrieval-augmented generation (RAG) performance evaluation—specifically, the benchmark “MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation” (Park et al., 23 Apr 2025)—detailing its dataset construction, unique adaptability metrics, evaluation methodology, experimental findings, and extensibility.

1. Motivation and Benchmark Definition

MIRAGE-Bench is designed to address deficiencies in RAG evaluation by providing a compact, rigorously curated question-answering (QA) dataset that enables component-level and joint evaluation of retrieval and generation in LLM systems augmented with external knowledge. Standard RAG benchmarks often conflate retrieval and generation quality or inadequately assess context sensitivity and noise robustness, limiting insight into failure modes or retriever–LLM alignment. MIRAGE-Bench targets this gap with a metric-intensive, diagnosis-oriented approach focused on context interaction, oracle-based context provision, and component-response stratification (Park et al., 23 Apr 2025).

2. Dataset Construction Protocol

MIRAGE-Bench consists of 7,560 QA pairs aligned to a retrieval pool of 37,800 document chunks (≈5 chunks per query). The construction pipeline has the following features:

Source Selection and Mapping: Queries are sampled from five Wikipedia-based QA corpora—PopQA, Natural Questions, TriviaQA, IfQA, DROP—filtered down from an initial 500K+ pool. Each QA pair is mapped back to its Wikipedia source article via Elasticsearch over the 2024-09-01 English Wikipedia dump. Items unmappable or with duplicate sources are discarded.
Chunking: Full articles are split into 330-token (sentence-preserving) chunks (totaling ≈16.5 million). For each query, the top-5 title-matched chunks are retrieved; further filtering is applied.
Multi-Layered Filtering:
- Support Labeling: A command-r model labels chunks for sufficient contextual support.
- Inference Validation: Llama-3.1-8B filters further by retaining only chunks that are empirically helpful to the LLM in answering the query; queries solvable without any context are removed to increase difficulty.
- Title Verification: Verifies at least one positive chunk per query originates from the expected Wikipedia page.
- Human Validation: Annotators check 100 queries/500 chunks, confirming 95% label accuracy (Krippendorff’s α=0.85).
Final Composition: Every instance has ≥1 positive and several negative chunks; the dataset is single-hop and not partitioned into train/val/test splits, allowing researcher-specified data usages.

3. RAG Adaptability Metrics

A central innovation of MIRAGE-Bench is its four “RAG Adaptability” metrics, which partition the dataset into subgroups according to binary model outcomes in three context settings:

$b = \text{Ans}_B(d)$ : No context (“base”)
$o = \text{Ans}_O(d)$ : Oracle context (one correct chunk)
$m = \text{Ans}_M(d)$ : Mixed (one oracle chunk + noise)

Let $G(b,o,m) = \{ d \in D \mid \text{Ans}_B(d)=b \wedge \text{Ans}_O(d)=o \wedge \text{Ans}_M(d)=m \}$ , with $|D|=7,560$ .

Metrics:

Metric	Formula	Interpretation
Noise Vulnerability	$(\|G(0,1,0)\| + \|G(1,1,0)\|) / \|D\|$	Fails in mixed (noise) context when oracle was sufficient
Context Acceptability	$(\|G(0,1,1)\| + \|G(1,1,1)\|) / \|D\|$	Succeeds in both oracle and mixed contexts
Context Insensitivity	$(\|G(0,0,0)\| + \|G(0,0,1)\|) / \|D\|$	Never aided by context; fails regardless
Context Misinterpretation	$(\|G(1,0,0)\| + \|G(1,0,1)\|) / \|D\|$	Hallucinates—model’s base correct, oracle context misleads

The metrics sum to $1$ by construction. This decomposition allows researchers to pinpoint specific adaptation failures or robustness patterns across retriever–LLM configurations.

4. Evaluation Methodology

MIRAGE-Bench is evaluated in both component-wise and end-to-end modes:

Retriever Evaluation: Measured using F1, Precision, Recall, NDCG on the 37,800-chunk pool.
LLM-Only Evaluation: Assessed by exact-match accuracy under three context regimes (Base, Mixed, Oracle).
RAG Profile: For each retriever × LLM pair × top-k setting ( $o = \text{Ans}_O(d)$ 0), all four adaptability metrics are computed, forming a multivariate “adaptability profile.”
Validation: In addition to automatic support labeling and inference checking, human spot-checks confirm label quality at 95% agreement.

5. Empirical Results and Insights

Experimental analysis in MIRAGE-Bench reveals several quantitative and qualitative trends:

Best Performance: GPT-4o + nv-embed-v2 (retriever) achieves Context Acceptability up to 80.5% (top-5) and as low as 10.6% Noise Vulnerability.
Worst Performance: Llama-2-7B + Contriever shows 47% Noise Vulnerability and 35% Context Acceptability.
Retriever-Only: nv-embed-v2 achieves F1≈73.9%, NDCG@1≈79.4%; Contriever is substantially lower.
LLM-Only: GPT-4o base accuracy ≈45.8% (improving to ≈91.1% with oracle context); Llama-2-7B base only 6.6%, max oracle 83%.
Trends:
- Noise Vulnerability decreases and Context Acceptability increases with retriever–LLM alignment and better retrieval.
- Context Insensitivity and Misinterpretation are largely LLM-bound; improved retrieval does not mitigate these.
- Top-3 retrieval typically is the optimal balance; Top-5 introduces detrimental noise for smaller LLMs but benefits the strongest models.

6. Practical Usage and Extensibility

MIRAGE-Bench is released with open data and code [https://github.com/nlpai-lab/MIRAGE], enabling rapid integration and further extension:

Format: JSONL, including per-query: question, answer, positive/negative chunk IDs, chunk texts, support labels.
Integration: Scripts to build Elasticsearch indices, vLLM-based inference utilities for arbitrary HuggingFace-compatible LLMs, and code for reproducible evaluation.
Customization: Retrievers and LLMs are swappable (retrieval.py, prompt templates), and the adaptability metric framework generalizes to richer evaluation settings (multi-stage reasoning, chain-of-thought) via subgroup patterning.

7. Significance and Research Impact

MIRAGE-Bench establishes a new standard for fine-grained RAG evaluation by disentangling the subtle interplay between retrieval and generation, with explicit focus on context adaptation, robustness to noise, and failure detection. Its interpretable metrics, tight dataset design, and extensible software have made it a reference point for: (1) benchmarking retriever–LLM integration; (2) charting progress in LLM context-sensitivity; (3) ablation-based studies of retrieval and generation roles in knowledge-intensive language tasks; and (4) rapid prototyping and iteration on new retrieval/generation architectures (Park et al., 23 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIRAGE-Bench.