MRBench Dataset: A Multimodal Evaluation Benchmark
- MRBench is a large-scale, human-annotated benchmark designed to evaluate multimodal retrieval-augmented generation systems by synthesizing text and image outputs.
- It comprises 4,346 documents, 14,190 images, and 4,800 QA pairs across web, academic, and lifestyle domains, testing complex multi-image, multi-step queries.
- The benchmark employs rigorous statistical and LLM-based metrics to assess retrieval accuracy, image grounding, and answer ordering in real-world scenarios.
MRBench is a large-scale, human-annotated benchmark specifically designed to evaluate Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) systems. It addresses the need for rigorous, domain-diverse, and challenging evaluation of generative models that synthesize interleaved text and image outputs, leveraging both retrieval and multimodal generation in open-domain, academic, and lifestyle contexts. Unlike previous benchmarks, MRBench systematically measures the ability of models to retrieve, ground, and accurately compose both textual and visual information in complex, multi-step QA scenarios, encompassing a wide range of real-world use cases (Yu et al., 6 Feb 2025).
1. Benchmark Composition and Dataset Structure
MRBench comprises 4,346 documents, 14,190 images, and 4,800 QA pairs, spanning three principal domains (Web, Academia, Lifestyle) and three ascending difficulty levels:
| Domain | Subset | #Docs | #Images | #QA Pairs | Mean Images/Doc |
|---|---|---|---|---|---|
| Web | MRAMG-Wit | 639 | 639 | 600 | 1.0 |
| MRAMG-Wiki | 538 | 538 | 500 | 1.0 | |
| MRAMG-Web | 1,500 | 1,500 | 750 | 1.0 | |
| Academia | MRAMG-Arxiv | 101 | 337 | 200 | 3.3 |
| Lifestyle | MRAMG-Recipe | 1,528 | 8,569 | 2,360 | 5.6 |
| MRAMG-Manual | 40 | 2,607 | 390 | 65.2 |
Multi-image QA instances are pervasive: 948 questions reference exactly two images (e.g., schematic comparison), and 862 questions require synthesis across three or more images (e.g., recipe step sequences, technical multi-panel figures). With an overall average of 3.27 images per document, the benchmark uniquely tests integration and ordering of visual evidence in answers (Yu et al., 6 Feb 2025).
2. Data Sources, Annotation, and QA Construction
MRBench aggregates and refines data from curated sources:
- Level 1 (Web): Draws on Wit (Wikipedia-based image-text), WikiWeb2M, and WebQA img-posFacts resources, with additional context expansion through GPT-4o to ensure caption-image alignment.
- Level 2 (Academic): 110 manually filtered arXiv LaTeX/PDF papers (2023–2024); technical questions and answers authored by graduate-level annotators.
- Level 3 (Lifestyle): RecipeQA evaluation set (step-by-step with images), and manuals from ManualsLib and Kaggle, parsed and cleaned via MinerU and human curation.
Annotation follows a semi-automated, multi-stage process:
- Question Generation:
- GPT-4o is used for initial QA construction (Web, Recipe), with expert questions for Arxiv and Manual.
- Answer Creation:
- Chain-of-Thought (CoT) GPT-4o prompting for draft generation, image placeholder assignment.
- Manual answer curation and ordering for technical and multi-image settings.
- Quality Control:
- Three-stage review combining annotator, GPT-4o, and senior expert intervention.
- Cross-annotator agreement approaches 100% on question validity, answer groundedness, and image selection.
QA pairs are encoded as interleaved lists of text and image blocks, preserving the order and semantics required for faithful multimodal answer generation (Yu et al., 6 Feb 2025).
3. Formal Task Definition
The MRAMG task requires, given a natural language query and a large corpus (each document consisting of text and image fragments), to:
- Retrieve the top- relevant documents
- Generate an answer as an ordered sequence of text and image references
Formally,
where is the generation framework and a text or multimodal model (LLM or MLLM).
For probabilistic models:
This structure supports both pipeline (retrieval → generation) and fully end-to-end approaches (Yu et al., 6 Feb 2025).
4. Evaluation Metrics and Protocols
Evaluation in MRBench leverages both statistical and LLM-based metrics, targeting both retrieval and generation components.
Retrieval Metrics:
- Context Recall@k: Fraction of queries for which all gold document text is in the top- retrieved results.
- Image Recall@k: Fraction of gold-referenced images among the top-.
Generation Metrics:
- Image Precision, Recall, F₁: over predicted and gold sets
- Image Ordering Score: Weighted edit distance between predicted and ground-truth image sequences.
- ROUGE-L: LCS-based overlap for textual segments [Lin 2004].
- BERTScore: Contextualized semantic similarity [Zhang et al. 2019].
- LLM-based metrics (scored 1–5 by GPT-4o):
- Image Relevance
- Image Effectiveness
- Image Position Score
- Comprehensive Multimodal Answer Quality
Bleu-1 and further metrics are available for fine-grained analysis, with evaluation code distributed for reproducibility (Yu et al., 6 Feb 2025).
5. Baseline Systems and Comparative Results
MRBench reports performance across 11 models under three answer generation paradigms:
- A) Rule-based insertion: e.g., DeepSeek-V3, which demonstrates competitive performance on Web data (Comp. = 80.18).
- B) Direct MLLM: Models such as Gemini-1.5-Pro (Web Comp. = 92.13, Paper = 85.90).
- C) LLM + placeholders: Combining retrieved evidence with LLM generation.
Closed-source models (GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro) systematically outperform open-source alternatives, especially on complex, multi-image answers. For example, Gemini-1.5-Pro achieves comprehensive scores of 94.35 (Web), 87.60 (Paper), and 83.10 (Lifestyle), yielding an overall ≈88.7 (Yu et al., 6 Feb 2025). Precise multi-image orderings remain challenging (GPT-4o bests others in Web Ord. = 43.5). Rule-based methods are viable for simple cases but degrade as question and context complexity increase.
6. Usage Considerations, Best Practices, and Pitfalls
Recommended pipeline:
- Retrieval: Embed with BGE-M3; retrieve top-10 by cosine similarity, aggregate associated images.
- Preprocessing: Chunk using SentenceSplitter (256 tokens); use standardized
<PIC>placeholders for image blocks in LLM settings. - Generation: In LLM modes, include surrounding context sentences. For MLLMs, CLIP-based ranking selects images ( images). Tune BLEU/BGE thresholds in rule-based settings.
Pitfalls include over-insertion of irrelevant images (monitor Image Precision), image misordering (train or instruct with ordering score), and context window truncation (chunking or longer-context models). Text-only QA is rare and only present in certain Lifestyle datasets. All data and code are released under CC BY 4.0 (https://huggingface.co/MRAMG), facilitating adoption and benchmarking (Yu et al., 6 Feb 2025).
7. Significance and Impact on Multimodal Retrieval-Augmented Generation
MRBench establishes the foundational evaluation benchmark for the MRAMG task: fully integrated text + image question answering grounded in retrieval from large, diverse domains. It differs from prior benchmarks by combining
- high multimodal density,
- diverse multi-image and multi-step scenarios,
- rigorous statistical and LLM-based evaluation,
- broad coverage (Web, scientific, real-world “how-to”).
This design enables robust tracking of MRAG and MLLM advances on fine-grained, realistic multimodal synthesis tasks. The benchmark exposes current model deficiencies—such as failure on answer grounding, image selection, and document-order tracking—and provides the structure required for future methodological progress in retrieval-augmented multimodal generative systems (Yu et al., 6 Feb 2025).