MRBench Dataset: A Multimodal Evaluation Benchmark

Updated 3 January 2026

MRBench is a large-scale, human-annotated benchmark designed to evaluate multimodal retrieval-augmented generation systems by synthesizing text and image outputs.
It comprises 4,346 documents, 14,190 images, and 4,800 QA pairs across web, academic, and lifestyle domains, testing complex multi-image, multi-step queries.
The benchmark employs rigorous statistical and LLM-based metrics to assess retrieval accuracy, image grounding, and answer ordering in real-world scenarios.

MRBench is a large-scale, human-annotated benchmark specifically designed to evaluate Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) systems. It addresses the need for rigorous, domain-diverse, and challenging evaluation of generative models that synthesize interleaved text and image outputs, leveraging both retrieval and multimodal generation in open-domain, academic, and lifestyle contexts. Unlike previous benchmarks, MRBench systematically measures the ability of models to retrieve, ground, and accurately compose both textual and visual information in complex, multi-step QA scenarios, encompassing a wide range of real-world use cases (Yu et al., 6 Feb 2025).

1. Benchmark Composition and Dataset Structure

MRBench comprises 4,346 documents, 14,190 images, and 4,800 QA pairs, spanning three principal domains (Web, Academia, Lifestyle) and three ascending difficulty levels:

Domain	Subset	#Docs	#Images	#QA Pairs	Mean Images/Doc
Web	MRAMG-Wit	639	639	600	1.0
	MRAMG-Wiki	538	538	500	1.0
	MRAMG-Web	1,500	1,500	750	1.0
Academia	MRAMG-Arxiv	101	337	200	3.3
Lifestyle	MRAMG-Recipe	1,528	8,569	2,360	5.6
	MRAMG-Manual	40	2,607	390	65.2

Multi-image QA instances are pervasive: 948 questions reference exactly two images (e.g., schematic comparison), and 862 questions require synthesis across three or more images (e.g., recipe step sequences, technical multi-panel figures). With an overall average of 3.27 images per document, the benchmark uniquely tests integration and ordering of visual evidence in answers (Yu et al., 6 Feb 2025).

2. Data Sources, Annotation, and QA Construction

MRBench aggregates and refines data from curated sources:

Level 1 (Web): Draws on Wit (Wikipedia-based image-text), WikiWeb2M, and WebQA img-posFacts resources, with additional context expansion through GPT-4o to ensure caption-image alignment.
Level 2 (Academic): 110 manually filtered arXiv LaTeX/PDF papers (2023–2024); technical questions and answers authored by graduate-level annotators.
Level 3 (Lifestyle): RecipeQA evaluation set (step-by-step with images), and manuals from ManualsLib and Kaggle, parsed and cleaned via MinerU and human curation.

Annotation follows a semi-automated, multi-stage process:

Question Generation:
- GPT-4o is used for initial QA construction (Web, Recipe), with expert questions for Arxiv and Manual.
Answer Creation:
- Chain-of-Thought (CoT) GPT-4o prompting for draft generation, image placeholder assignment.
- Manual answer curation and ordering for technical and multi-image settings.
Quality Control:
- Three-stage review combining annotator, GPT-4o, and senior expert intervention.
- Cross-annotator agreement approaches 100% on question validity, answer groundedness, and image selection.

QA pairs are encoded as interleaved lists of text and image blocks, preserving the order and semantics required for faithful multimodal answer generation (Yu et al., 6 Feb 2025).

3. Formal Task Definition

The MRAMG task requires, given a natural language query $q$ and a large corpus $\mathcal{D}$ (each document consisting of text $T_i$ and image $I_i$ fragments), to:

Retrieve the top- $k$ relevant documents $\mathcal{D}_q^*$
Generate an answer $\mathcal{A}$ as an ordered sequence of text and image references

Formally,

$\mathcal{A} = \mathcal{F}(q, \mathcal{D}_q^*, \mathcal{M})$

where $\mathcal{F}$ is the generation framework and $\mathcal{M}$ a text or multimodal model (LLM or MLLM).

For probabilistic models: $\mathcal{A} = \arg\max_{\mathcal{A}} P(\mathcal{A}\mid q, \mathcal{D}_q^*)$

This structure supports both pipeline (retrieval → generation) and fully end-to-end approaches (Yu et al., 6 Feb 2025).

4. Evaluation Metrics and Protocols

Evaluation in MRBench leverages both statistical and LLM-based metrics, targeting both retrieval and generation components.

Retrieval Metrics:

Context Recall@k: Fraction of queries for which all gold document text is in the top- $k$ retrieved results.
Image Recall@k: Fraction of gold-referenced images among the top- $k$ .

Generation Metrics:

Image Precision, Recall, F₁: $|\text{images}_\text{pred}\cap\text{images}_\text{gold}|$ over predicted and gold sets
Image Ordering Score: Weighted edit distance between predicted and ground-truth image sequences.
ROUGE-L: LCS-based overlap for textual segments [Lin 2004].
BERTScore: Contextualized semantic similarity [Zhang et al. 2019].
LLM-based metrics (scored 1–5 by GPT-4o):

Image Relevance
Image Effectiveness
Image Position Score
Comprehensive Multimodal Answer Quality

Bleu-1 and further metrics are available for fine-grained analysis, with evaluation code distributed for reproducibility (Yu et al., 6 Feb 2025).

5. Baseline Systems and Comparative Results

MRBench reports performance across 11 models under three answer generation paradigms:

A) Rule-based insertion: e.g., DeepSeek-V3, which demonstrates competitive performance on Web data (Comp. = 80.18).
B) Direct MLLM: Models such as Gemini-1.5-Pro (Web Comp. = 92.13, Paper = 85.90).
C) LLM + placeholders: Combining retrieved evidence with LLM generation.

Closed-source models (GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro) systematically outperform open-source alternatives, especially on complex, multi-image answers. For example, Gemini-1.5-Pro achieves comprehensive scores of 94.35 (Web), 87.60 (Paper), and 83.10 (Lifestyle), yielding an overall ≈88.7 (Yu et al., 6 Feb 2025). Precise multi-image orderings remain challenging (GPT-4o bests others in Web Ord. = 43.5). Rule-based methods are viable for simple cases but degrade as question and context complexity increase.

6. Usage Considerations, Best Practices, and Pitfalls

Recommended pipeline:

Retrieval: Embed with BGE-M3; retrieve top-10 by cosine similarity, aggregate associated images.
Preprocessing: Chunk using SentenceSplitter (256 tokens); use standardized <PIC> placeholders for image blocks in LLM settings.
Generation: In LLM modes, include surrounding context sentences. For MLLMs, CLIP-based ranking selects images ( $N\le8$ images). Tune BLEU/BGE thresholds in rule-based settings.

Pitfalls include over-insertion of irrelevant images (monitor Image Precision), image misordering (train or instruct with ordering score), and context window truncation (chunking or longer-context models). Text-only QA is rare and only present in certain Lifestyle datasets. All data and code are released under CC BY 4.0 (https://huggingface.co/MRAMG), facilitating adoption and benchmarking (Yu et al., 6 Feb 2025).

7. Significance and Impact on Multimodal Retrieval-Augmented Generation

MRBench establishes the foundational evaluation benchmark for the MRAMG task: fully integrated text + image question answering grounded in retrieval from large, diverse domains. It differs from prior benchmarks by combining

high multimodal density,
diverse multi-image and multi-step scenarios,
rigorous statistical and LLM-based evaluation,
broad coverage (Web, scientific, real-world “how-to”).

This design enables robust tracking of MRAG and MLLM advances on fine-grained, realistic multimodal synthesis tasks. The benchmark exposes current model deficiencies—such as failure on answer grounding, image selection, and document-order tracking—and provides the structure required for future methodological progress in retrieval-augmented multimodal generative systems (Yu et al., 6 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MRBench Dataset.