DeepMMSearchVQA Dataset

Updated 16 October 2025

DeepMMSearchVQA is a multimodal VQA dataset that integrates images and text with explicit tool annotations to simulate multi-turn, real-world search interactions.
It employs an automated pipeline with Gemini-2.5-Pro and GroundingDINO for iterative reasoning, dynamic search, and precise image processing.
The dataset supports both supervised finetuning and reinforcement learning (GRPO) to train models for adaptive, evidence-based multimodal reasoning.

DeepMMSearchVQA is a multimodal visual question answering (VQA) dataset designed for training and evaluating large multimodal LLMs (MLLMs) in real-world web search and information-seeking scenarios. Its primary role is to provide data for both supervised and reinforcement learning paradigms, targeting knowledge-intensive tasks that require dynamic integration and reasoning over visual and textual modalities with external information retrieval. The dataset is purpose-built to simulate multi-turn interactions exemplary of how MLLMs should coordinate internal reasoning, tool invocation, and iterative query refinement for complex fact-finding missions.

1. Dataset Composition and Structure

DeepMMSearchVQA comprises multimodal question–image pairs embedded within multi-turn conversational traces. Each example includes:

Question and image inputs: Typically sourced from the InfoSeek training set.
Structured annotations: Explicit tags such as <reason> (internal reasoning), <img_search> (image search, either on the whole image or a salient crop as determined by grounding models), <text_search> (text search invocation), and <answer> (final model output). These tags clarify tool activation points and delineate internal reasoning states throughout multi-turn dialogue.
Web-retrieved information: Summarized content from top-k search results, integrated into the flow with annotations that mimic authentic user–search agent interactions.

The dataset’s format enforces a tight coupling between raw modalities and reasoning/tool use. Its annotation sequence enables detailed tracing of not only answer generation but also the model’s decision process for evidence gathering.

2. Automated Generation Pipeline

The DeepMMSearchVQA pipeline is fully automated. Its stages include:

Initialization: Question–image pairs are presented to Gemini-2.5-Pro, which drafts both reasoning traces and candidate tool activations.
Quality Control: Each candidate interaction passes through multiple filtration checks (Checks A, B, C), removing instances with inconsistencies or inadequate reasoning.
Tool Invocation and Data Augmentation: Upon deciding a search is necessary, text and image search tools are triggered. Text search employs an in-house API to return top‑k web results, which are then distilled and supplied to the agent for further reasoning and possible query refinement. Ambiguous visual questions prompt GroundingDINO-based cropping, improving relevance of image search.
Iterative Feedback: Retrieved content is looped into the conversation, enabling recursive refinement—self-reflection on what was learned, why further search might be needed, and how to adjust queries or crop regions for subsequent retrieval.

This pipeline yields realistic, multi-turn interaction data with explicit tool calls and justification tags, accurately reflecting how a model should operate when lacking internal knowledge or clarifying ambiguous visual queries.

3. Core Features and Diversity

DeepMMSearchVQA is notable for the following attributes:

Balanced question taxonomy: The dataset contains both search-required and search-free samples, with explicit reasoning as to why a search action is or isn’t necessary.
Multi-hop, multi-turn interactions: Many instances demand reasoning steps that span repeated evidence gathering, self-reflection, and self-correction—foundational for training agents to manage uncertainty and query adaptation.
Structured conversational annotation: Each example logs not only answers but also intermediate reasoning, crop instructions, and summarized search results, producing a high-fidelity transcript of model–tool workflows.

This diversity across both modalities and reasoning styles ensures broad coverage of information-seeking needs, ranging from simple direct queries to multi-stage, cross-modal investigations.

4. Function in Supervised and RL Training

The dataset plays crucial roles in two distinct training strategies:

Supervised Finetuning (SFT): DeepMMSearchVQA is used to teach the base model (Qwen2.5-VL-7B-Instruct, updated via LoRA adapters) explicit protocols for deciding when to search, which tool to use, and how to parse or integrate retrieved context. Its annotated conversations allow for direct learning of structured tool calls and reasoning patterns.
Reinforcement Learning (RL) with GRPO: During online RL optimization, the structure of DeepMMSearchVQA enables models to experiment with decision-making—exploring trade-offs between internal knowledge use and external search actions, refining queries iteratively in multi-turn interactions. This feedback is leveraged in the Group Relative Policy Optimization (GRPO) algorithm, which rewards not only correct answers but also judicious (efficient) tool use.

In both regimes, the dataset’s multi-turn, annotated nature is key for exposing models to the complex reasoning and evidence-integration required for high-accuracy, low-latency multimodal VQA.

5. Integration of Real-world Search and Visual Reasoning

DeepMMSearchVQA embeds real-world data into the learning process by:

Authentic search session simulation: Web search APIs provide actual external information for the model to integrate, with search results (summarized) fed back to the model during ongoing conversation.
Dynamic image region selection: For ambiguous or composite visual queries, GroundingDINO is used to crop relevant regions, making image search actions context-sensitive and precise.
Iterative error correction: Instances where initial queries or region selections fail to yield sufficient information prompt further reasoning or query adjustment, teaching models to engage in realistic self-correction.

This suggests that models trained on DeepMMSearchVQA can generalize from static knowledge retrieval to adaptive, real-time web search and visual interpretation—skills required for deployment in real-world, dynamic scenarios.

6. Mathematical Objectives and Training Losses

Relevant mathematical formulations used during training include:

Causal Language Modeling objective (SFT):

$\mathcal{L}_{\mathrm{SFT}} = -\sum_{t=1}^{N} \log \pi_{\theta}(y^*_t | x, I, y^*_{<t})$

Here, $N$ is the target sequence length; $(x, I)$ are multimodal inputs; $y^*_t$ denotes the target token sequence (reasoning and tool calls); and $\pi_\theta$ is the conditional model distribution.

Group Relative Policy Optimization (GRPO, RL stage):

$\mathcal{L}_{\mathrm{GRPO}} = \mathbb{E}_{i,t} [ \min(\rho^{(i)}_t A^{(i)}, \operatorname{clip}(\rho^{(i)}_t, 1-\epsilon, 1+\epsilon) A^{(i})) ] - \beta \cdot KL(\pi_\theta || \pi_{\mathrm{ref}})$

Where $\rho^{(i)}_t$ is the probability ratio for rollout $i$ , $A^{(i)} = R^{(i)} - \bar{R}$ is the advantage, and $KL(\cdot)$ regularizes divergence from the reference (supervised) distribution.

This mathematical design both incentivizes correct answers and penalizes excessive deviation from efficient, structured search and reasoning patterns.

7. Comparison and Context within Multimodal VQA Research

In comparison to datasets such as MultiModalQA (Talmor et al., 2021) and FactualVQA from MMSearch-R1 (Wu et al., 25 Jun 2025), DeepMMSearchVQA distinguishes itself via:

Automated pipeline with real web search integration: It does not rely primarily on pre-existing knowledge corpora or handcrafted modalities.
Explicit tool call and reasoning annotation: The annotation scheme mirrors practical deployment needs, where a model must decide dynamically which external evidence to recruit.
Support for self-reflection and self-correction: Beyond static reasoning or single-hop search, the dataset enforces iterative, multi-hop, and adaptive information gathering.

A plausible implication is that models trained on DeepMMSearchVQA will be better equipped for dynamic, open-world knowledge tasks, with efficient tool use and reasoned decision-making—addressing the rigidities and inefficiencies observed in previous RAG and search agent approaches.

DeepMMSearchVQA is a targeted dataset for advancing multimodal VQA systems, offering a comprehensive, structured, and dynamic resource for training models in on-demand, real-world web search and multimodal reasoning contexts. Its richness in multi-turn dialogue and explicit reasoning, coupled with authentic data retrieval workflows, makes it integral to state-of-the-art multimodal agent design and optimization as demonstrated in DeepMMSearch-R1 (Narayan et al., 14 Oct 2025).