Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

DeepMMSearchVQA: Multimodal VQA with Dynamic Search

Updated 15 October 2025
  • The paper introduces DeepMMSearchVQA, a multimodal dataset that automates multi-turn search through structured tool calls to enhance visual question answering.
  • The methodology employs multi-hop and multi-modal query design, integrating whole image, cropped image, and text searches to iteratively refine responses.
  • DeepMMSearch-R1 leverages a two-stage training regime combining supervised finetuning and reinforcement learning to dynamically integrate external evidence and boost accuracy.

DeepMMSearchVQA refers to a dataset, methodology, and multimodal LLM training paradigm for information-seeking and knowledge-intensive Visual Question Answering (VQA) where the model dynamically determines when to invoke external web search tools and how to reason over their retrieved information in multi-turn interactions. Central contributions include the automated construction of the DeepMMSearchVQA dataset, multi-hop and multi-modal annotation conventions, and a novel model architecture (DeepMMSearch-R1) that integrates reinforcement learning for adaptive search behavior and response accuracy in practical web environments (Narayan et al., 14 Oct 2025).

1. Automated Construction of DeepMMSearchVQA Dataset

The DeepMMSearchVQA dataset is generated by augmenting an existing knowledge-intensive VQA corpus (InfoSeek training set) through an automated pipeline that passes each question–image pair to a multimodal reasoning model (Gemini-2.5-Pro). This model simulates realistic, multi-turn conversations annotated with structured tool-call tags for whole image search, cropped image search (focused via a referring expression and region proposal), and text web search (<img_search>, <text_search> blocks).

Crucial quality control is performed at multiple stages: samples are discarded if they fail any of three checks (A, B, C), and agreement with InfoSeek ground-truth answers is required. Web information retrieved during simulated interaction is integrated in <information> blocks, thereby teaching the model both to consume and reason about external evidence. The resulting dataset contains diverse, multi-hop examples across the knowledge taxonomy, incorporating both search-required and search-free cases. Annotation captures reasoning steps, dynamic tool selection, and refined query generation—including crop-based image search for improved relevance in cluttered visual scenes.

2. Multi-hop and Multi-modal Query Design

Each DeepMMSearchVQA sample is a multi-turn dialogue instantiating complex reasoning flows. Samples may involve iterative text search—where the model updates queries based on intermediate retrieved evidence—and cropped image search, guided by referring expressions that identify critical visual entities. This approach allows the training data to teach selective tool invocation, stepwise query refinement, and background suppression (by cropping images to exclude irrelevant context).

Samples cover the full breadth of information-seeking VQA: some can be answered directly (search-free), while others necessitate sophisticated, multi-hop retrieval and evidence synthesis. This diversity enables training of robust web-search agents capable of adapting to varied knowledge demands.

3. Two-stage Model Training Regime

DeepMMSearch-R1 is instantiated using a two-part training pipeline:

  • Supervised Finetuning (SFT): The base MLLM (Qwen2.5-VL-7B-Instruct) is adapted with LoRA adapters using annotated multi-turn conversations from DeepMMSearchVQA. This teaches the model explicit conventions for tool invocation (generating <img_search> and <text_search> tags), dynamic selection of search type (whole vs. cropped image), and sophisticated integration of retrieved evidence.
  • Online Reinforcement Learning (RL): A subsequent stage applies GRPO (Group Relative Policy Optimization), using multi-sample rollouts to optimize for factual accuracy and prompt adherence. A composite reward evaluates both correctness (via exact match or semantic assessment) and format compliance, further incentivizing efficient, adaptive search strategies. Iterative text search and self-reflection behavior is reinforced, enabling multi-turn refinement of queries and answers.

The supervised objective uses Causal Language Modeling:

LSFT=t=1Tlogπθ(ytx,I,y<t)L_{\text{SFT}} = -\sum_{t=1}^T \log \pi_{\theta}\left(y_t^* \mid x, I, y_{<t}^*\right)

with TT as the target length and πθ\pi_{\theta} as the model’s predictive distribution.

The RL objective under GRPO:

LGRPO=Ei,t[min(ρt(i)A(i),clip(ρt(i),1ϵ,1+ϵ)A(i))]βKL(πθπref)L_{\text{GRPO}} = \mathbb{E}_{i,t}\left[ \min(\rho_t^{(i)} A^{(i)}, \text{clip}(\rho_t^{(i)}, 1-\epsilon, 1+\epsilon) A^{(i)}) \right] - \beta\, \text{KL}(\pi_{\theta}\,\|\,\pi_{\text{ref}})

where ρt(i)\rho_t^{(i)} is the token-level probability ratio, ϵ\epsilon the clipping parameter, and β\beta the KL scaling factor.

4. Model Capabilities and Reasoning Behavior

DeepMMSearch-R1 is designed to support real-world information-seeking behavior:

  • On-demand Multi-turn Web Search: The model dynamically determines when a question requires external evidence, initiating image or text searches only as necessary.
  • Query Crafting and Refinement: Text search is iteratively refined, with the model using retrieved information for self-correction and enhanced precision.
  • Cropped Image Search: For visually complex scenes, the model produces referring expressions for region proposals (using e.g., Grounding DINO), crops the image accordingly, and then submits the crop for search—reducing the impact of irrelevant background and improving answer accuracy.
  • Tool Utilization and Reasoning: The model learns to reason over retrieved evidence, decide whether to answer or continue searching, and balance internal knowledge with external query calls.

5. Empirical Outcomes and Benchmark Performance

Extensive evaluation is reported on benchmarks including InfoSeek, Enc-VQA, SimpleVQA, DynVQA, OKVQA, and A-OKVQA. DeepMMSearch-R1 consistently surpasses baseline methods using fixed retrieval-augmented generation (RAG) and prompt-based agents. Ablation studies confirm that self-reflection (iterative text search) and cropped image search yield tangible gains—for example, improved performance by reducing noise from irrelevant visual content.

The best generalization is achieved when the fine-tuning set is balanced near 50% search-required and 50% search-free samples across knowledge categories. This approach enables robust model generalization, aligning answer accuracy with real-world demands for both direct retrieval and knowledge-intensive VQA.

DeepMMSearchVQA provides a paradigm shift in multimodal VQA, showing that integration of multi-turn search and dynamic query adaptation is essential for knowledge-intensive applications. The dataset and methods teach models to recognize the boundaries of their own knowledge and invoke external search judiciously—rather than performing excessive retrievals or rigid query calls.

Such design enables deployment in dynamic environments, such as digital assistants, educational platforms, or complex information retrieval systems that require up-to-date, accurate synthesis of visual and textual information from the web. The iterative, reasoning-driven process underlying DeepMMSearch-R1 represents a foundational methodology for next-generation multimodal search and question answering agents.

7. Outlook and Future Research Directions

Areas identified for future advancement include refining image and text search pipelines for greater stability, developing reward models with flexible semantic assessment, expanding the approach to more open-ended and complex QA tasks, and broadening the DeepMMSearchVQA spirit to interactive, trustworthy multimodal agents capable of verification, summarization, and attribution of retrieved content. Technical integration of referring expression generation, multi-hop reasoning, and adaptive search scheduling remains ongoing avenues for empirical and algorithmic research in multimodal knowledge systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeepMMSearchVQA.