MERRIN: Multimodal Search Benchmark

Updated 4 July 2026

MERRIN is a human-annotated benchmark for search-augmented agents that evaluates modality selection, evidence retrieval, and reasoning under noisy conditions.
The benchmark uses plain natural language queries without explicit modality cues and integrates text, image, video, and table sources to test multi-hop reasoning.
It distinguishes errors in retrieval, modality selection, and reasoning, highlighting the challenges of effective multimodal evidence acquisition in real-world settings.

Searching arXiv for the MERRIN paper and closely related benchmark/evaluation references. {"query":"MERRIN benchmark multimodal evidence retrieval reasoning noisy web environments arXiv", "max_results": 10} Searching for BrowseComp, which MERRIN follows for LLM-as-judge evaluation. {"query":"BrowseComp arXiv browsecomp evaluation LLM-as-judge", "max_results": 10} Searching for SealQA and ChartMuseum, which are listed as partial sources for adapted items in MERRIN. {"query":"SealQA arXiv ChartMuseum arXiv multimodal QA benchmark", "max_results": 10} MERRIN, short for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments, is a human-annotated benchmark for search-augmented agents designed to evaluate whether an AI system can infer the needed modality, retrieve relevant multimodal evidence, and reason over noisy, heterogeneous, and often conflicting web sources. Its defining premise is that realistic web search is both underspecified and multi-hop: queries are presented in plain natural language without explicit modality cues, and correct resolution requires engagement with non-text evidence rather than text-only shortcuts. MERRIN is positioned as a benchmark for open-web search settings in which text, image, video, and table sources interact under conflict, incompleteness, and retrieval noise (Wang et al., 15 Apr 2026).

1. Motivation and problem formulation

MERRIN was introduced in response to three limitations in prior multimodal search benchmarks. First, earlier benchmarks often include explicit modality cues in the question, effectively telling the model whether to inspect an image, video, or other medium. Second, they tend to emphasize text and image, while leaving video and audio underexplored. Third, they do not adequately model the noise, conflict, and incompleteness characteristic of real web search results (Wang et al., 15 Apr 2026).

The benchmark therefore targets a more difficult and operationally relevant regime. Questions are intentionally written in plain natural language with no explicit modality cues, so modality selection is itself part of the task. Every question is manually verified to require non-text evidence, and each has a single short, unambiguous answer. This design makes the benchmark a test of retrieval, modality inference, and reasoning simultaneously, rather than a conventional fact lookup task.

A plausible implication is that MERRIN is less a benchmark of static multimodal QA than of search-conditioned evidential decision making. The benchmark does not merely ask whether a model can read a modality once retrieved; it asks whether the model can navigate the open web, identify the useful medium, and ground a final answer under source-level ambiguity.

2. Task structure, modalities, and annotation schema

MERRIN organizes its questions along two annotation axes. The first is reasoning type. A question may require multi-hop reasoning, meaning that the answer depends on combining information across multiple steps or sources; multimodal conflict resolution, meaning that the agent must reconcile conflicting evidence returned by the web; or both. The second axis is the role of the non-text modality. Non-text evidence may be the answer source, where the final answer is directly contained in that modality, or a reasoning component, where it supplies an intermediate fact needed to derive the answer. Some questions instantiate both roles (Wang et al., 15 Apr 2026).

The benchmark covers four source types: Text, Image, Video, and Table. At the same time, the benchmark emphasizes that video and audio are especially underexplored in prior work, and that in MERRIN, video sources incorporate both visual and audio modalities. This yields a slightly broader operational modality space than a simple four-way source taxonomy might suggest.

MERRIN also identifies three core failure modes. A retrieval error occurs when the agent chooses the correct modality but the wrong source. A modality error occurs when the agent uses the wrong modality, often defaulting to text. A reasoning error occurs when the right source is retrieved but the model grounds or combines the evidence incorrectly. This taxonomy is significant because it separates failures of search policy from failures of evidence use. It also undercuts the common simplification that multimodal benchmarking can be reduced to a single end-to-end accuracy number.

3. Dataset construction and quality control

MERRIN contains 162 questions in total. Of these, 120 were created from scratch, 37 were adapted from SealQA, and 5 were adapted from ChartMuseum. For adapted items, the original QA pair was used as one hop and then extended with additional evidence to form a new multi-hop question. Annotators record the ground-truth answer, a step-by-step reasoning explanation, source URLs, source type for each resource, whether non-text evidence is the answer source or a reasoning component, whether the question is multi-hop and/or involves multimodal conflict, and the origin of the question (Wang et al., 15 Apr 2026).

The quality-control pipeline is unusually strict. A second annotator checks answer correctness, clarity, difficulty, and whether non-text evidence is truly required. More importantly, non-text necessity is verified through a two-pass protocol: a Standard text-only search pass and an Adversarial search pass, where the known answer is inserted into the query to test whether a text-only shortcut exists. A question is retained only if at least one sub-question survives both passes without becoming solvable from text alone. Rejection was common: about 39.5% of candidates were rejected in the first round, and 45.3% of those rejected were revised and accepted later.

The dataset statistics indicate both modality diversity and substantial reasoning complexity. The distribution is 31.4% text, 35.9% image, 28.8% video/audio, with the remainder in tables and other appendix-level distribution details. 73.5% of questions require both multi-hop reasoning and multimodal conflict resolution. The average number of gold resources is about 2.0 per question. MERRIN also includes temporal annotations in the form of Effective Year and Freshness, the latter categorized as never-, slow-, or fast-changing.

This construction methodology suggests that MERRIN is designed to minimize benchmark artifacts, particularly hidden text-only shortcuts. That design choice is central to its role as a stress test for genuine multimodal search competence.

4. Evaluation protocol and benchmark settings

MERRIN is evaluated using answer accuracy, defined as whether the predicted answer matches the gold answer. The benchmark uses an LLM-as-judge following BrowseComp-style evaluation. Because answers are designed to be precise and unambiguous, this procedure is described as usually close to exact match after normalization; the judging prompt extracts the final answer and accepts it if it matches the gold answer or falls within a small numerical tolerance (Wang et al., 15 Apr 2026).

The benchmark does not define a conventional train/dev/test partition, because it is presented as a benchmark, not a training dataset. Instead, it reports results for the full 162-question benchmark, a 50-question subset for human evaluation, and a 50-example subset for two-step failure analysis.

The experimental design compares 10 models across 3 search settings. The models include closed-source systems—GPT-5.4-nano, GPT-5.4-mini, Gemini 3 Flash, Gemini 3 Pro, Gemini 3.1 Lite, Gemini 3.1 Pro, Gemini Deep Research Agent—and open-weight systems—Qwen3-4B, Qwen3-30B, and Qwen3-235B. The three search settings are:

No Search: no tools, only parametric knowledge.
Native Search: model-provided tools; for GPT, web_search; for Gemini, Google Search + URL Context; not applicable to Qwen models.
Agentic Multimodal Search: a custom smolagents agent with visit_webpage and watch_video tools.

A key operational distinction is that Native Search often cannot process video/audio well, whereas the agentic configuration is explicitly designed to handle text, images, and video more flexibly. This makes MERRIN not only a benchmark of reasoning quality but also a benchmark of tool-interface adequacy.

5. Quantitative results and empirical behavior

MERRIN is reported as highly challenging. The average accuracy across all runs is 22.3%, and the best-performing agent reaches only 40.1%. When aggregated by search setting, the average is around 17.3% for No Search, 23.1% for Native Search, and 33.7% for Agentic Multimodal Search. The best individual result is Gemini 3.1 Pro + Agentic Multimodal Search at 40.1%. Gemini Deep Research Agent with Native Search reaches 33.3%. GPT-based systems generally lag Gemini systems under native search, and open-weight Qwen models improve less from search than closed models (Wang et al., 15 Apr 2026).

Setting or system	Accuracy
Average across all runs	22.3%
No Search average	17.3%
Native Search average	23.1%
Agentic Multimodal Search average	33.7%
Gemini 3.1 Pro + Agentic Multimodal Search	40.1%
Gemini Deep Research Agent + Native Search	33.3%

The reported results also challenge a common assumption that more search activity necessarily improves outcomes. The paper states that the number of search queries or pages visited does not correlate strongly with accuracy. Stronger systems often over-search rather than search better. The benchmark therefore distinguishes between raw search effort and effective evidence acquisition.

An additional intervention supports the importance of video. When a video tool is added to Native Search, performance improves by an average of 5.7% across four Gemini models. This suggests that the benchmark’s multimodal difficulty is not merely nominal; video access materially changes the attainable performance ceiling.

6. Failure modes, human comparison, and benchmark significance

The failure analysis identifies three salient patterns. First, agents show an over-reliance on text. For the best agent, retrieved evidence is 87.7% text, 6.8% image, and 5.5% video/audio combined, even though the benchmark itself is much more balanced across modalities. Second, stronger agents often exhibit over-exploration. Gemini Deep Research times out on 33.1% of questions, and Gemini Pro Native Search hits “Too_Many_Tool_Calls” on 12.7%. Third, agents show inefficient source selection: human URL precision is 38.1%, whereas the agentic system’s URL precision is 1.8% (Wang et al., 15 Apr 2026).

The paper also argues that reasoning is the bigger bottleneck. For Gemini 3.1 Pro, the move from Agentic Multimodal Search at 40.1% to the final gold sources prompting upper bound at 47.7% yields only modest gains. This indicates that removing search noise helps, but not enough to close the performance gap; reasoning over evidence remains the primary limitation.

Human comparison reinforces that conclusion. On a 50-question subset, human accuracy is 71.4%, compared with 40.1% for Agentic Multimodal Search (Gemini 3.1 Pro) and 30.9% for Native Search (Gemini 3.1 Pro). Humans also use fewer resources: 2.9 searches on average versus 9.1 for the agentic system, and 2.9 webpage visits versus 3.5. Human URL selection is markedly better, with 38.1% precision, 48.9% recall, and 42.8% F1. Extra time benefits humans substantially more: they improve from 59.2% at 5 minutes to 71.4% overall, whereas agents improve only from 29.6% to 30.9% in native mode and from 34.0% to 40.1% in agentic mode.

Human error patterns are also diagnostically different. Mistakes are often small extraction errors rather than failures of source discovery: wrong count accounts for 43%, right source, wrong detail for 29%, partial/imprecise answer for 14%, and other for 14%. This suggests that humans generally solve the search and modality-selection problem more effectively, while remaining fallible at fine-grained extraction.

Taken together, these results establish MERRIN as a benchmark centered on three intertwined capacities: modality inference, source selection under web noise, and evidence-grounded multi-hop reasoning. A plausible implication is that progress on MERRIN will require not only better multimodal encoders or better search tools in isolation, but tighter coupling between search policy, modality-aware evidence acquisition, and robust reasoning under conflict.

Markdown Report Issue Upgrade to Chat

References (1)

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MERRIN.