Papers
Topics
Authors
Recent
2000 character limit reached

HR-MMSearch: Multimodal Benchmark

Updated 2 January 2026
  • HR-MMSearch is a high-resolution benchmark that evaluates multimodal agentic reasoning and deep tool use in vision-language models.
  • It integrates pixel-level image analysis with search-driven external knowledge retrieval to overcome limitations of shallow retrieval approaches.
  • Rigorous evaluation protocols and failure mode analyses highlight the need for precise tool orchestration and multi-hop planning improvements.

HR-MMSearch (sometimes referenced as HERB or MMSearch-Plus) is a high-resolution, knowledge-intensive benchmark crafted to evaluate agentic reasoning, multi-modal retrieval, and deep multimodal tool-use in large vision-language and multimodal models. It is distinct in requiring the seamless interleaving of search-driven external knowledge retrieval and pixel-level image analysis, with robust evaluation of multi-hop reasoning across image and web modalities (&&&0&&&, Tao et al., 29 Aug 2025).

1. Motivation and Benchmarking Gaps

The design of HR-MMSearch responds to limitations in prior retrieval-augmented generation (RAG) and multimodal agent benchmarks. Existing datasets typically permit strong models to succeed with shallow workflows based solely on high-recall text/image retrieval or direct visual question answering (VQA), failing to capture the local and long-range dependencies necessary for complex, real-world information-seeking scenarios:

  • Enterprise-Scale Search Deficiency: Most enterprise and multimodal question answering tasks blend structured metadata and unstructured interaction logs (e.g., chat, documents, code reviews), involve realistic noise, and demand systematic cross-source, multi-hop reasoning (Choubey et al., 29 Jun 2025).
  • Shallow Vision-Language Integration: Early benchmarks (e.g., MMSearch) allow models to answer by exploiting salient visual cues and retrieving page text with minimal cross-validation or fine-grained spatial analysis (Jiang et al., 2024).
  • Insufficient Tool Reasoning: Pure VQA or browsing benchmarks neglect the coordinated, iterative use of tools such as image search, web text search, and region cropping (Chng et al., 30 Dec 2025, Tao et al., 29 Aug 2025).
  • Real-World Complexity: True events, small-object or micro-text localization, and ambiguous visual contexts are underrepresented in static VQA settings.

This suggests HR-MMSearch was constructed to force fine-grained, tool-integrated, spatial-temporal extrapolation and persistent multi-hop planning.

2. Dataset Structure and Construction

HR-MMSearch is designed as a hybrid between enterprise search and multimodal browsing agent evaluation, combining several rigorous selection and annotation strategies:

  • Image-Question Pairs: 305–311 distinct 4K-resolution images, each paired with a knowledge-intensive, search-driven question. Images are sourced from late-cutoff (post-training) news agency photos (Reuters, AP, CNBC, etc.), ensuring minimal data leakage and high visual fidelity (Chng et al., 30 Dec 2025, Tao et al., 29 Aug 2025).
  • Domains: Tasks span eight primary domains—Geography, Sports, Academic Research, Film & TV, Technology, Video Games, Vlogs, and Music—to ensure domain balance and coverage.
  • Item Construction: Each example requires identification and reasoning about key visual subjects that typically occupy <5% of the image area (e.g., a tiny sign, a logo, or distant textual overlay). Questions are constructed so that answers cannot be found within the image directly and instead mandate compounding web retrieval and/or cropping operations.
  • Annotation and Verification: Questions are authored and answers cross-verified by multiple annotators (bachelor- and master-level) for clarity, tool-use adherence, and correctness.
  • Difficulty Splits: Each data item is tagged "hard" if eight independent agent rollouts (Qwen2.5-VL-7B-Instruct) fail to find the answer, leading to a substantial set of adversarially constructed samples (Chng et al., 30 Dec 2025).

3. Task Design and Tool-Use Protocol

The HR-MMSearch task is formatted as an interactive, multi-turn search and reasoning protocol, emphasizing stepwise planning and effective tool invocation:

  • Agentic Trajectory: At each reasoning turn, the agent produces an internal rationale, selects exactly one action (text search, image search, image crop, or answer emission), and receives an observation (e.g., snippet, image, crop) before updating its context.
  • Available Tools:
    • text_search(query): Yields top web snippets.
    • image_search(index): Returns reverse-image search titles and thumbnails.
    • image_crop([x₁, y₁, x₂, y₂], index): Crops a normalized image region.
  • Workflow Continuation: The sequence continues until an answer is emitted or the maximum interaction length (typically 10) is reached.
  • Spatial-Temporal Extrapolation: Answering frequently requires sequentially cropping visual cues, performing web or image search, extracting and cross-validating retrieved facts, and inferring off-image events (e.g., inferring event dates from signage, match venues from overlays) (Tao et al., 29 Aug 2025).

4. Evaluation Metrics and Difficulty Analysis

Rigorous evaluation protocols ensure that only high-fidelity, multi-hop strategies can attain high accuracy:

  • Primary Metric: Pass@1 accuracy: percentage of examples for which the agent’s output exactly matches the ground-truth answer:

Acc=1Ni=1N1(a^i=aigt)\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\hat{a}_i = a_i^{\mathrm{gt}})

  • Hard vs. Easy Tasks: Items are assigned 'hard' status if pass@8 = 0 (failure on all eight independent rollouts).
  • Bounding-Box Quality: Cropping correctness is evaluated using recall at IoU ≥ 0.5.
  • Recall@K for Cropped-Image Retrieval: Measures the fraction of tasks where the top-K retrieved results include the correct page (Tao et al., 29 Aug 2025).
  • Long-Horizon Planning: The effect of search rounds on accuracy is tracked; performance often improves initially but can regress with poor planning policies.

5. Baseline Performance and Comparative Results

A diverse array of closed- and open-source models as well as bespoke agentic and RAG approaches have been evaluated:

Agent / Model Direct Answer (%) Agentic Zero-Shot (%) Agentic Fine-Tuned (%)
Qwen2.5-VL-7B-Instruct 0.58 19.34
Qwen2.5-VL-32B-Instruct 3.93 33.44
Qwen3-VL-8B-Instruct 12.13 27.87
GPT-4o 13.11 30.16
Gemini-3-Flash 21.97 41.64
GPT-5 22.62 38.36
SenseNova-MARS-8B 41.64
  • Standard inference without external tools yields near-zero accuracy, especially for open-source models.
  • Restricting tools to text/image search or cropping alone is insufficient; hard questions typically require at least three tool invocations, and some cases involve coordinated use of all operations (Chng et al., 30 Dec 2025).
  • SenseNova-MARS, a reinforcement learning framework integrating hybrid search and perception RL, achieves 41.64% (fine-tuned) on HR-MMSearch, matching or surpassing proprietary baselines (e.g., Gemini-3-Flash, GPT-5) (Chng et al., 30 Dec 2025).

6. Failure Modes and Remaining Challenges

Analysis of agentic rollouts across models and inference paradigms reveals persistent obstacles:

  • Partial Context Retrieval: Agents often reason over single top-K retrievals and fail to integrate multiple weak signals, leading to missed evidence.
  • Tool Misuse: Incorrect selection or chaining of tools (e.g., miscropping, unnecessary search invocations).
  • Fine Region Localization: Queries involving micro-text or objects occupying <2% of image often fail due to poor localization or recognition capacity.
  • Retrieval Noise: Irrelevant or semantically close, but incorrect search snippets induce hallucinations or fact misattribution.
  • Over- and Under-Cropping: When agents deviate from optimal cropping policies, crucial spatial cues may be omitted or drowned in extraneous context (Chng et al., 30 Dec 2025, Tao et al., 29 Aug 2025).

7. Broader Significance and Future Directions

HR-MMSearch has established itself as a rigorous diagnostic for multimodal agentic intelligence:

  • Unified Evaluation Across Modalities: Unlike classical VQA or browsing datasets, it forces spatial, temporal, and information retrieval competencies within a single benchmark.
  • Bottleneck Identification: Pass@1 ≪ 50% for all current models demonstrates persistent gaps not in model parametric knowledge, but in deep search, source-aware recall, and dynamic, tool-adaptive planning (Choubey et al., 29 Jun 2025, Tao et al., 29 Aug 2025).
  • Methodological Implications: RL-trained hybrid perception+search policies exhibit best performance; this suggests future progress will require learning fine-grained, uncertainty-aware retrieval and cropping policies, as well as explicit sub-query routing and provenance verification.
  • Next Steps: Proposed directions include expanding synthetic cross-domain pipelines (finance, healthcare), improving unanswerability detection, and integrating robust multi-hop agentic tool orchestration (Choubey et al., 29 Jun 2025, Chng et al., 30 Dec 2025).

HR-MMSearch thus stands as a key multimodal and enterprise deep search benchmark, demanding and revealing the next generation of retrieval-augmented, unified tool-use, and fine-grained multimodal reasoning capabilities.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HR-MMSearch Benchmark.