MM-BrowseComp: Multimodal Browsing Benchmark

Updated 9 April 2026

MM-BrowseComp is a benchmark that evaluates AI agents' retrieval and reasoning by integrating multimodal cues from text, images, and videos.
It enforces irreducible checklists and mandatory multimodal dependency to mimic real-world, cross-modal information searches.
Using detailed metrics, the benchmark highlights challenges such as visual hallucination and tool-execution failures in current AI solutions.

MM-BrowseComp is a benchmark specifically designed to evaluate the retrieval and reasoning capabilities of AI browsing agents in multimodal web environments where essential information is distributed across text, images, and videos. Unlike previous benchmarks that are text-centric, MM-BrowseComp enforces scenarios where agents must integrate non-textual clues to achieve high accuracy, thereby mirroring the complexity of real-world web search and research tasks.

1. Motivation and Distinctive Aims

MM-BrowseComp was created to address fundamental limitations in the evaluation of web browsing agents, particularly the text-centric bias of benchmarks such as BrowseComp. While prior datasets pushed deep-search and persistence to the limit, they assumed all essential clues were either embedded in page text or retrievable through textual queries. However, real-world information seeking requires fluid transitions between reading text, extracting facts from images (e.g., icons, charts, diagrams), and parsing embedded videos or screenshots. MM-BrowseComp operationalizes the following key design principles (Li et al., 14 Aug 2025):

Mandatory multimodal dependency: Each question embeds crucial clues within images or videos, precluding “text-only” shortcuts.
Irreducible reasoning checklists: Human-annotated stepwise trajectories define the minimal evidence chain—omission of any step renders the instance unsolvable except by chance.
High adversarial difficulty: No item can be solved by leading vision-LLMs (VLMs; e.g., GPT-4o, Gemini-2.5-Pro) in a single pass, and questions are hand-tuned for inaccessibility via conventional search.

2. Benchmark Construction and Dataset Properties

MM-BrowseComp comprises 224 hand-crafted questions, partitioned into 22 subtasks spanning five high-level domains: Media, Technology, Society, Geography, and Academics. The construction pipeline incorporates multiple validation and filtering stages:

Expert Authorship and Calibration: Each of >20 AI researchers contributed questions in their specialty, initially submitting small pilot sets for method calibration.
Core construction criteria:
- Multimodal necessity: Each answer requires “seeing”—e.g., reading a visual icon, decoding a diagram—or parsing a short video segment lacking textual equivalents.
- Irreducible checklist: Each instance is annotated with a minimal list of 3–5 sequential reasoning steps, encompassing both tool invocations (e.g., reverse image search, table extraction) and intermediate inference hops.
- Adversarial validation: Items must defeat both top-tier models and human solvers (≤5 min) in pilot testing; ambiguous or trivially solvable items are removed.

The finalized set has these empirical features:

Start Modality	Percentage of Items
Image prompt	57%
Text prompt	43% (but requires multimodal step)

No item is answerable by reading text alone; cross-modality retrieval and chained reasoning are enforced at each stage (Li et al., 14 Aug 2025).

3. Evaluation Protocols and Metrics

To assess performance, MM-BrowseComp uses a multi-metric “Pass@1” protocol:

Overall Accuracy (OA): Fraction of instances with correct final answer, regardless of reasoning path.
Strict Accuracy (SA): Fraction with both correct answer and complete, annotated checklist traversal—SA reflects cases where the model robustly follows the minimal evidence chain.
Average Checklist Score (AVG CS): For each instance, the fraction of checklist steps achieved; averaged across the test set, this reflects partial progress on the required trajectory.

Formally:

OA = $\frac{\#\text{correct final answers}}{\#\text{questions}} \times 100\%$
SA = $\frac{\#\text{correct \emph{and} all checklist completed}}{\#\text{questions}} \times 100\%$
$\mathrm{AVG\,CS} = \frac{1}{N}\sum_{i=1}^N \frac{\#\text{items completed}_i}{\#\text{total checklist}_i} \times 100\%$

This protocol enables separation of answer-level correctness from genuine multi-hop, multimodal reasoning fidelity.

4. Experimental Results and Failure Analysis

Benchmarking under this regime exposes pronounced limitations in existing agents:

Model Category	Model	OA (%)	SA (%)	AVG CS (%)
Tool-Free VLMs	GPT-4.1	7.6	5.4	14.7
	Gemini-2.5-Pro	6.3	4.5	11.6
Tool-Augmented VLMs	OpenAI o3 (+ tools)	29.0	19.6	36.5
	Gemini-2.5-Pro (+)	7.1	3.6	15.2
Open-Source Agents	Agent-R1	5.6	3.7	11.0
	OWL	5.6	0.0	7.1

Even with tool augmentation, state-of-the-art (OpenAI o3) only achieves 29% OA and 19.6% SA (Li et al., 14 Aug 2025).
The disparity between OA and SA signifies that many correct answers are obtained through speculative guessing or incomplete trajectories.
AVG CS analysis indicates that completion rates drop when image/video-based checklist steps arise; OpenAI o3 retains 62.1% text-checklist and 52.7% image/video-checklist performance, but other models exhibit even sharper modality drop-offs.

Error taxonomy for open-source agents reveals major contributions from:

INCORRECT_REASONING (34–43%)
VISUAL_HALLUCINATION (9–23%)
TOOL_EXECUTION_FAILURE (5–19%)

5. Multimodal Reasoning Dependencies and Failure Modes

MM-BrowseComp explicitly enforces irreducible multimodal dependencies:

~57% of items require agents to process an initial image prompt, while 43% begin as textual queries but demand multimodal steps later in the trajectory.
Successful completion is contingent on accurate visual processing (image/video understanding, OCR, logo/icon recognition), precise chaining across modalities, and reliable cross-source synthesis.
Failure arises when agents either hallucinate ungrounded facts, lose track of checklist steps following ambiguous visual cues, or are unable to robustly extract information from image or embedded media.

A salient finding is that models relying solely on captioning tools are heavily bottlenecked by information loss; models that natively re-ingest full images (as OpenAI o3 does) show significant improvement but remain limited (Li et al., 14 Aug 2025). Test-time ensembling (multi-run and aggregate) boosts OA only via chance coverage and fails to improve checklist-completed (SA) performance, highlighting the centrality of robust reasoning over mere answer enumeration.

6. Comparative Perspective and Research Directions

MM-BrowseComp is positioned as the first adversarially validated, multimodal browsing benchmark with comprehensively documented and irreducible reasoning chains (Li et al., 14 Aug 2025). Compared to BrowseComp (text-only), Video-BrowseComp (video-centric) (Liang et al., 28 Dec 2025), MMSearch-Plus (micro-cue multimodality) (Tao et al., 29 Aug 2025), and BrowseComp-V³ (subgoal annotated, multi-hop, cross-modal) (Zhang et al., 13 Feb 2026), MM-BrowseComp distinguishes itself by:

Strictly requiring image/video evidence at the core of each question.
Employing detailed, stepwise checklists for path analysis.
Delivering direct measurement of modality-specific agent performance.

Future development recommended by the authors includes:

Pretraining vision–LLMs on authentic browsing trajectories (combining multi-turn image and text actions).
Integration of specialized visual tools (e.g., chart readers, video frame analyzers) into agent pipelines to reduce information loss in image-based tasks.
Use of irreducible checklists as dense supervision targets in reinforcement learning, promoting robust multi-hop, cross-modal reasoning.
Enhancement of memory-augmented browsing architectures to support long, branched search sequences.
Dynamic identification of subtask structure and modality focus, enabling adaptive browsing strategies.

A plausible implication is that only agents combining strong, integrated multimodal backbones with sophisticated tool-use and planning architectures will be able to reliably solve MM-BrowseComp-style tasks. The benchmark thus delineates a new frontier for open-world AI research agents capable of “reading, seeing, clicking, and watching” with fluid, verifiable, human-level reasoning.

References: