DeepMMSearch-R1: Multimodal Web Search Agent
- The paper introduces DeepMMSearch-R1, a multimodal LLM system that leverages dynamic tool selection and multi-turn reasoning to enhance search accuracy by over 20 percentage points compared to RAG baselines.
- DeepMMSearch-R1 is built on Qwen2.5-VL-7B-Instruct and integrates explicit tool-use annotation to seamlessly coordinate full-image, cropped-image, and text searches within structured dialogues.
- Its two-stage training pipeline—combining supervised fine-tuning and Group-Relative Policy Optimization—refines tool invocation, reduces redundant search calls, and enables iterative self-correction for complex queries.
DeepMMSearch-R1 is a multimodal LLM (MLLM) system engineered to empower web-scale search with dynamic, on-demand interaction over both textual and visual modalities (Narayan et al., 14 Oct 2025). It is designed to overcome the rigidity of standard retrieval-augmented generation (RAG) and search-agent pipelines by offering multi-turn planning, dynamic tool selection, fine-grained web search invocation, and iterative self-correction across image and text search domains.
1. System Architecture and Multimodal Tool Integration
DeepMMSearch-R1 is architected atop Qwen2.5-VL-7B-Instruct, with a unified framework that coordinates internal language/vision reasoning, web-scale search tool access, and structured response annotation. The system incorporates three core tool interfaces:
- Text Search Tool: Retrieves current factual information via web-based APIs given a natural language query.
- Image Search Tool: Accepts either the full image or a semantically relevant visual crop, supports retrieval of visually similar images and associated metadata from the web.
- Grounding Tool: Utilizes Grounding DINO to produce referring expressions and spatial regions within the input image, precisely identifying and cropping the content most pertinent to the user’s query.
This modular assembly enables selective invocation of tool calls, with the model dynamically deciding among (a) direct answering, (b) whole-image search, (c) cropped-region image search, and (d) text search. All tool-use is explicitly annotated via structured conversation formatting (e.g., <img_search>, <text_search>, <answer>), allowing downstream tracking and supervision.
To isolate the most relevant visual semantics, DeepMMSearch-R1 leverages the grounding tool to produce a referring expression and identify the spatial coordinates for region-of-interest cropping before submitting the image search query. This reduces extraneous background and amplifies the signal-to-noise ratio for entity-centric retrieval tasks.
2. Search Mechanism and Multi-turn Reasoning
A central capability of DeepMMSearch-R1 is multi-turn search and reasoning. On receipt of a multimodal input (question + image), the model first assesses knowledge sufficiency based on internal parametric memory. If uncertain, it can:
- Trigger a visual search using either the full image or a semantically precise crop.
- Parse web search responses and summarize retrieved content in natural language, feeding these as context for subsequent steps.
- Iteratively formulate more specific or corrective text queries for subsequent web searches if needed ("self-reflection" and "self-correction").
This iterative loop is managed so that retrieved content and prior web interaction history are available in the conversational state, enabling the model to adaptively refine the query and aggregation strategy. Tool-use is strictly controlled via structured output tokens, with each search decision embedded as an explicit action tag in the conversational rollout.
3. Training Pipeline: Supervised and Reinforcement Learning
Training DeepMMSearch-R1 follows a two-phase process:
- Cold Start Supervised Fine-tuning (SFT): The LLM is fine-tuned on the DeepMMSearchVQA dataset—composed of multi-turn conversations annotated with tool action tags—using LoRA adapters (rank r = 8) and a fixed vision encoder. This phase instills initial tool-use patterns, annotation syntax, and reasoning strategies over retrieved content. The loss function is standard Causal Language Modeling, but only tokens for reasoning traces and structured tags are optimized (retrieved web text is masked during loss computation):
- Group-Relative Policy Optimization (GRPO) Reinforcement Learning: The model subsequently undergoes online RL wherein complete reasoning trajectories are sampled, each assigned a task reward (semantic correctness, tool-use efficiency). Group-level averaging is used to emphasize relative reward, and the policy is updated via a clipped surrogate loss. This phase refines tool-call decisions, minimizes redundant web search, and aligns search invocation frequency with actual knowledge gaps.
The GRPO-based RL objective considers the relative quality of a batch of sampled outputs and can optimize for both answer accuracy and minimal, necessary search calls, allowing the model to distinguish when it can confidently answer without external retrieval.
4. DeepMMSearchVQA Dataset and Instruction Paradigm
DeepMMSearchVQA is a curated instruction-tuning dataset central to supervised training. It is constructed by:
- Sampling multi-hop, knowledge-intensive queries targeting both visual and textual domains.
- Using Gemini-2.5-Pro to generate step-wise reasoning traces including action tags for (a) whole image search, (b) cropped image search (with referring expression and coordinates), and (c) text search.
- Balancing the dataset to include both search-required (where external retrieval is indispensable) and search-free instances (answerable from model internals).
- Filtering and taxonomy balancing to cover diverse knowledge domains and multi-modal information needs.
Multi-turn dialogues in DeepMMSearchVQA are structured so that tool-invocation, search results integration, and stepwise reasoning are all made explicit. This provides the inductive bias for dynamic, adaptive search behavior in downstream deployment.
5. Experimental Performance and Empirical Insights
DeepMMSearch-R1 is benchmarked on InfoSeek, Enc-VQA, SimpleVQA, DynVQA, OKVQA, and A-OKVQA, using an LLM-as-judge evaluation (GPT-5-chat-latest or GPT-4o). Key findings include:
- After RL refinement, DeepMMSearch-R1 consistently outperforms RAG and prompt-based agent search baselines (by over 20 percentage points in average accuracy in some setups).
- Allowing for self-reflection—enabling multiple, adaptive text search rounds—yields measurable gains in factual correctness and robustness across benchmarks.
- Ablations reveal that a balanced mix of search-required and search-free training samples is essential: an under-representation of search-free examples results in systematic overuse of the search function.
- The combination of initial SFT and RL not only improves answer accuracy but also reduces redundant or unnecessary search tool invocation.
Table 1: Performance Summary (Excerpt)
System | Avg. Accuracy Δ | Search Calls Δ |
---|---|---|
DeepMMSearch-R1 (RL) | +20% | – (optimized) |
SFT-only | Baseline | +30% |
RAG Agent | –20% | Uncontrolled |
Accuracy and search call statistics are relative to standard RAG and agent-based baselines (Narayan et al., 14 Oct 2025).
6. Key Methodological Features and Advanced Capabilities
DeepMMSearch-R1 introduces several methodological advances distinguishing it from prior work:
- Dynamic Cropped-Image Search: The integration of a grounding module enables selective cropping of semantically relevant regions, thereby reducing background noise in visual search and improving entity-level retrieval accuracy.
- Iterative Query Refinement: Multi-turn, tool-augmented dialogue empowers the model not just to retrieve but to self-correct its queries based on web results, an essential feature for open-world and dynamic factual tasks.
- Explicit Tool-use Annotation: Action-tagged annotation in both dataset and model output facilitates downstream evaluation, error analysis, and further proxy reward engineering in RL.
- Hybrid Two-Stage Training: Combination of explicit instruction-following from SFT and outcome-driven adaptive reasoning from RL delivers both strong inductive bias and practical efficiency at test time.
7. Implications, Limitations, and Prospective Research
DeepMMSearch-R1 provides a foundation for knowledge-intensive, real-world multimodal search agents. Its core contributions are:
- Architectural flexibility for tight integration of external knowledge via dynamic web tools.
- Demonstrated benefit of multi-stage training (SFT+RL) in aligning tool-use with knowledge gaps, avoiding excessive and inefficient search.
- Empirical validation on public benchmarks showing increased factuality, reduced search redundancy, and improved multi-hop reasoning.
A limitation noted is the reliance on access to high-quality search engines and web tools; the system may be affected by changing web interfaces or content noise. Furthermore, while the paradigm of explicit region grounding and action annotation boosts visual search, extension to more complex or ambiguous visual queries (e.g., those lacking discrete entities) may require new grounding strategies.
Future research directions include:
- Expansion of tool diversity to include additional modalities (e.g., audio, structured data) or reasoning forms (e.g., code execution, tabular summarization).
- Improvements to long-context reasoning and tool-planning over extended multi-hop web information trails.
- Scale-up to address domain-specific or multilingual search tasks, leveraging adaptive training data construction and reward shaping.
DeepMMSearch-R1 thus represents a significant step toward practical, end-to-end, tool-augmented multimodal search, with a methodological blueprint for integrating structured web interaction into MLLM architectures (Narayan et al., 14 Oct 2025).