Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Abstract: Multimodal LLMs (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching AI systems that can see images and read text (called multimodal LLMs, or MLLMs) to do “deep research” on the web. The authors show that today’s AIs often fail when they try one simple search and stop. Their new approach, called Vision-DeepResearch, helps the AI search in many steps, look at different parts of an image, try multiple queries, and gather trustworthy evidence before answering a question.
Key Questions the Paper Tries to Answer
- Why do current AIs miss the right information when searching with images and text?
- How can we make an AI search more like a careful human—step by step, across different sources, and focusing on the right details?
- Can training the AI with better data and practice (including rewards for correct answers) make it consistently better at fact-based questions about images?
How They Did It (Methods Explained Simply)
Think of the AI like a detective:
- It doesn’t just look at the whole picture once. It zooms into different parts (like a person’s face, a logo, or a sign) at different sizes. This is “multi-entity, multi-scale visual cropping.”
- It tries many searches (both image-based and text-based), visits webpages, and summarizes what it finds.
- It keeps going until a “judge” model decides there’s enough proof to move on.
Here’s how the training process works:
- Building better practice questions:
- The team creates tricky visual questions (VQA) that need real-world facts. Instead of easy questions, they add “obfuscation” (making the question less obvious) so the AI has to connect multiple clues—like a real multi-hop puzzle.
- Example: Instead of “What’s the cat’s name?” it becomes “The cat’s owner works at A, and the owner’s daughter studies at B. What is the cat’s name?”
- Bridging images to text:
- The AI describes the image in words and then uses a strong text-based deep-research model to continue long, careful search steps. This helps transfer “long-horizon” skills (many steps of think–search–check) from text to vision.
- Two-stage training:
- Supervised Fine-Tuning (SFT): The AI reads lots of example “trajectories” (step-by-step search sessions) and learns the pattern: think → choose tools (search, visit pages, summarize) → collect evidence → answer.
- Reinforcement Learning (RL): The AI practices in a real online search environment. If its final answer is correct, it gets a reward; if not, it doesn’t. This encourages smarter, longer (but efficient) searching over time.
- Practical tricks to keep training stable:
- Stop loops if the AI repeats itself.
- Ignore broken or very messy runs so they don’t poison learning.
- Use fast, parallel tool calls so trying many crops/queries doesn’t slow everything down.
Helpful terms in everyday language:
- “Hit-rate problem”: When your search doesn’t find the right webpage or image, even if you tried a reasonable query. The paper fixes this by trying more focused, varied searches.
- “ReAct (reasoning-then-tool-call)”: The AI thinks first, then uses tools (like search engines), then thinks again based on what it finds, and repeats.
- “Long-horizon”: The AI can perform dozens of steps and hundreds of tool uses, like a thorough investigator, instead of giving up after a few tries.
Main Findings and Why They’re Important
The new Vision-DeepResearch system consistently beats other open and closed AI systems on six tough benchmarks that require accurate facts and careful visual understanding. It performs well even with smaller models (8B parameters) and gets even better when scaled to 30B.
Key takeaways:
- Looking at different parts of an image (cropping) plus doing text searches works best. Cropping gives precise visual clues; text search brings in broader facts.
- Moving from simple training (SFT) to practice with rewards (RL) further improves accuracy.
- The model handles “noisy” real-world search better—where the first try often fails—by trying many smart steps instead of stopping early.
In short, Vision-DeepResearch makes AI better at answering complex, fact-heavy questions about images by acting more like a patient, methodical researcher.
What This Means for the Future
This research could lead to:
- Smarter assistants that can verify facts and interpret images from the web (news photos, event posters, product pages).
- Better tools for students and teachers to explore visual information with trustworthy sources.
- More reliable AI agents for browsing, shopping, and visual tasks (like recognizing items or reading complex charts).
The team plans to release code, and they expect even more gains by scaling RL training further. Overall, Vision-DeepResearch shows a strong path toward AI that can truly “do its homework” on the visual web: searching widely, thinking deeply, and backing answers with solid evidence.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, framed concretely for future investigation:
- Quantitative characterization of the “hit-rate problem”: no systematic measurement of retrieval success vs. crop scale, entity granularity, and engine type; absent per-query hit-rate metrics and variance analyses across visual/text search.
- Reproducibility of the search stack: the specific visual and textual search engines, their configurations (APIs, settings, rate limits), and any pre/post-processing are not documented, hindering replication and controlled comparisons.
- Unspecified “text-based DeepResearch foundation LLM”: the identity, capabilities, and version of the foundation LLM used for trajectory bridging are not disclosed, limiting reproducibility and ablation of its contribution.
- Reliance on ground-truth-conditioned judging in the data pipeline: the visual search termination judge consults the correct answer, which is infeasible at inference time; a practical, evidence-based termination criterion is not proposed or evaluated.
- Lack of provenance and grounding metrics: evaluation focuses on accuracy, with no measures of evidence attribution (e.g., citation correctness, page/source reliability) or visual-text alignment fidelity, especially on provenance-aware benchmarks.
- Fidelity of image-to-text “bridging” descriptions: the accuracy and completeness of generated descriptions (and their impact on downstream text-only deep-research quality) are not quantified; robustness to description errors is unexplored.
- Box proposal quality: bounding boxes are generated by an MLLM without benchmarking against classical detectors; the trade-off between proposal quality, number of crops, and retrieval performance is not analyzed.
- Search efficiency and user-facing cost: no reporting of wall-clock time, API/tool-call counts, and monetary cost per question at inference; latency and throughput impacts of “hundreds of engine interactions” remain unquantified.
- RL scaling laws and budgets: RL is acknowledged as under-scaled; the effects of longer horizons, larger datasets, and different reward schedules on performance and stability are not studied, nor are compute budgets reported.
- Reward shaping limitations: an accuracy-only terminal reward may incentivize shortcutting and inefficient tool use; no exploration of dense/intermediate rewards (e.g., evidence retrieval success, provenance compliance, tool efficiency).
- Reliability and bias of LLM-as-Judge: the correctness, calibration, and failure modes of the judge model (both for trajectory selection and RL rewards) are not validated; human evaluation or multi-judge consensus is absent.
- Robustness to adversarial or misleading content: behavior under deceptive webpages, near-duplicate images, SEO spam, and dynamic content changes is not stress-tested; error recovery strategies are unspecified.
- Generalization to multi-image and multi-modal inputs: the method is evaluated primarily on single-image tasks; extensions to multi-image, video, audio, or GUI environments are untested.
- Multilingual search and cross-lingual grounding: performance and tooling in non-English settings (queries, webpages, OCR) are not addressed; language-specific search engine behaviors remain unexplored.
- Dataset selection bias and coverage: use of partial splits (e.g., VDR test-mini, MMSearch-Plus single-image subset) and limited random samples (e.g., 300 per benchmark) may not reflect full difficulty or diversity; no stratified analyses.
- Domain shift from synthetic “fuzzy multi-hop” to real user queries: the realism of obfuscated/constructed questions vs. organic queries is not validated; potential templating artifacts or shortcut pathways are not measured.
- Evidence of overfitting or leakage: interactions with real search during RL raise risks of memorizing benchmark-specific pages; no safeguards or analyses of overlap between training interactions and test sources.
- Failure analysis and error taxonomy: the paper lacks a breakdown of common failure modes (e.g., proposal mislocalization, wrong entity linking, stale/incorrect sources, formatting/tool-call errors) and targeted mitigations.
- Comparative fairness of baselines: “same agentic-reasoning setting” is asserted, but differences in tool ecosystems, prompt formats, context windows, and API constraints across closed/open models are not audited.
- Security and safety of agentic behaviors: website visitation and Python execution are mentioned without sandboxing, content filtering, or security policies; potential misuse, data exfiltration, and harmful actions are not addressed.
- Ethical and legal considerations: scraping/summarizing web content raises copyright, privacy, and attribution concerns; compliance strategies (robots.txt, rate-limiting, source licensing) are not discussed.
- Calibration of visual search depth: the “obfuscated termination” strategy and judge criteria are not ablated; how to set and adapt depth thresholds (turns, crop counts) for different tasks and engines remains open.
- Trade-off policies for multi-entity/multi-scale exploration: no policy learning analysis for choosing crop sizes/entities under resource constraints; optimal scheduling of breadth vs. depth is unstudied.
- Extensibility across base models and sizes: RL was applied to the 30B model but not the 8B; how benefits transfer across architectures, parameter scales, and context length capabilities is unclear.
Glossary
- Agentic workflows: Iterative reasoning pipelines where models plan, act with tools, and reflect over multiple steps. "compared to prior multimodal deep- research MLLMs and agentic workflows."
- Answer obfuscation: A technique that increases reasoning depth by hiding or chaining relationships around the true answer. "Answer obfuscation, which increases the required rea- soning depth by chaining relations around the answer (e.g., "What is the name of the teacher of the cat owner's daugh- ter?")."
- Asynchronous rollout pipeline: A high-throughput, multi-threaded RL execution pipeline that returns tool observations asynchronously to avoid blocking. "we design a high-throughput multi-threaded asyn- chronous rollout pipeline building on rLLM framework (Tan et al., 2025)."
- Autoregressive cross-entropy loss: The standard next-token prediction objective used to train sequence models. "The model is trained by minimizing the standard autoregres- sive cross-entropy loss (CE loss)."
- Autoregressive supervision: Training that supervises the model token-by-token across reasoning, tool calls, and final answers. "Autoregressive supervision is applied at each step of every trajectory, covering the > reasoning, <tool_call> actions, and the final <answer>."
BF16 vs. FP16: Two floating-point formats with different numerical properties used to train large models. "3. BF16 vs. FP16."
- Cold-start supervision: Initial supervised training to instill desired behaviors before RL fine-tuning. "internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training,"
- Context window: The maximum sequence length a model can process at once. "passing it directly to the MLLM can easily exceed the context window."
- Deep-research foundation LLM: A text-only LLM specialized in long-horizon ReAct-style search and reasoning. "To leverage the strong ReAct-style capability of the text-based deep-research foundation LLM, we bridge the above visual trajectory Cvision to the text-only context."
- Event loop: The control flow that orchestrates asynchronous tasks; can stall under synchronous rollouts. "naive synchronous rollouts can severely stall the event loop."
- Fuzzy Multi-hop VQA Synthesis: A pipeline to create complex, multi-hop visual QA problems via entity and answer obfuscation. "Fuzzy Multi-hop VQA Synthesis."
- Group Relative Policy Optimization (GRPO): An RL algorithm variant that optimizes policies using group-relative advantages. "we apply RL training with Group Relative Policy Optimization (GRPO) (Shao et al., 2024; Guo et al., 2025)"
- Hit-rate problem: The challenge that queries often fail to retrieve required evidence due to noise and variability. "This overlooks a critical challenge in realistic, noisy search engines: the hit-rate problem."
- Image-description-based context window sharing: Bridging visual and text trajectories by replacing images with detailed textual descriptions within the same context. "seamlessly transfer their long-horizon ReAct- style reasoning to the visual domain via image-description- based context window sharing."
- Leave-One-Out trick: A stabilization technique in RL that removes one sample when computing advantages to reduce variance. "with Leave- One-Out trick (Ahmadian et al., 2024; Luo; Chen et al., 2025)"
- LLMs-as-Judge: Using an LLM to evaluate answers and provide reward signals during RL training. "For reward design, we adopt LLMs-as-Judge paradigm,"
- Long-Horizon ReAct-Style Trajectory: Extended sequences of planning, acting, and reflecting with many tool calls. "1.1. Long-Horizon ReAct-Style Trajectory"
- Multi-entity and Multi-scale Visual Cropping and Search: A retrieval strategy that probes multiple regions and scales to improve evidence hit-rate. "Multi-entity and Multi-scale Visual Cropping and Search."
- Obfuscated termination strategy: A method that controls visual retrieval depth by masking cues about when to stop. "We also introduce an obfuscated termination strategy to control the depth of visual retrieval."
- Queued scheduler: A task dispatcher that enqueues rollout jobs to maximize throughput. "It dispatches tasks via a queued scheduler"
- rLLM framework: A framework for post-training language agents supporting RL rollouts and tool integration. "building on rLLM framework (Tan et al., 2025)."
- ReAct: A paradigm that synergizes reasoning and acting via iterative tool use. "ReAct (Yao et al., 2022)"
- Rejection sampling: Selecting only trajectories whose final outputs match ground truth for training. "We then apply rejection sampling to select trajectories Cmultimodal for cold-start training,"
- Reinforcement learning (RL): Training via interaction and rewards to refine long-horizon decision-making. "reinforcement learning (RL)."
- Retrieval-augmented generation (RAG) workflows: Pipelines that augment generation with retrieved evidence but may be naive for multimodal tasks. "while naive RAG workflows provide limited gains on the reported settings."
- Tool pool: A managed set of tools that can be invoked concurrently within a single action. "and maintains a tool pool to support concurrent multi-tool calls within a single action"
- Trial-and-error process: An iterative query refinement approach to handle noisy search environments. "it should be modeled as a trial-and-error process"
- Vision-Pipeline: The visual toolchain that executes image search, page visits, and summarization. "O = Vision-Pipeline(Atv), (1)"
- Visual region proposals: Candidate bounding boxes generated to localize entities for search. "generate multi-entity, multi-scale visual region proposals to efficiently probe visual search engines"
- Visual search tool: A tool that takes cropped image regions and returns matched webpage URLs. "The visual search tool takes a cropped image region as input and returns the matched webpage URL,"
- Website summary tool: A tool that condenses fetched page content and verifies image correspondences. "the visual search tool, website visit tool and website summary tool."
- Website visit tool: A tool that fetches page content in markdown for analysis. "the visual search tool, website visit tool and website summary tool."
Practical Applications
Immediate Applications
The following applications can be deployed now by integrating the paper’s multi-entity, multi-scale image search, long-horizon ReAct-style agentic reasoning, and the provided training and rollout recipes into existing systems.
- Newsroom image-centric fact-checking and provenance verification — Sectors: media, policy. Enables editors to crop multiple regions from an image (e.g., people, logos, venues), run parallel reverse image searches, and bridge verified visual hits into textual web evidence for claim checking. Tools/products/workflows: “Box Image Search” UI, parallel crop-to-search executor, web visit and summarization pipeline, LLM-as-Judge for answer verification. Assumptions/dependencies: reverse image search APIs and web crawling allowed by ToS; human-in-the-loop for editorial standards; logging for auditability.
- Social media trust & safety triage for multimodal misinformation — Sectors: platform trust & safety. Automates triage by isolating salient entities in user-shared photos and iteratively searching to detect reused or doctored images and mismatched contexts. Tools/products/workflows: moderation agent workflow with multi-scale cropping, evidence aggregation, provenance report. Assumptions/dependencies: scalable API access; privacy compliance; rate-limit management; robust false-positive mitigation.
- E-commerce product identification and attribute reconciliation from user-uploaded photos — Sectors: retail, marketplaces. Matches partial product views (tags, labels, packaging features) to catalog entries and long-tail specs using iterative crop-to-search plus text search. Tools/products/workflows: attribute resolver service, catalog linker, parallel crop search engine connector. Assumptions/dependencies: high-quality catalog metadata; handling of visually similar SKUs; brand/IP compliance.
- Brand and influencer monitoring via visual cues — Sectors: marketing, PR. Detects brand presence (logos, uniforms, event setups) from images and links them to campaigns, events, or mentions across web pages. Tools/products/workflows: brand monitoring agent with multi-entity cropping, web visit summarizer, alerting dashboard. Assumptions/dependencies: coverage of relevant sites; legal review for surveillance boundaries; model guardrails for ambiguity.
- Customer support assistants resolving issues from screenshots/photos — Sectors: software/IT support. Understands GUI screenshots, isolates UI components, searches docs/forums, and returns grounded troubleshooting steps. Tools/products/workflows: GUI-aware agent with crop-to-search over UI elements, text bridging to knowledge bases, step-by-step ReAct reasoning. Assumptions/dependencies: access to product documentation; sensitive-data redaction in screenshots; permissioned browsing.
- Enterprise digital asset management with grounded tagging — Sectors: enterprise software, DAM. Generates semantically rich, provenance-grounded tags by linking cropped entities to verified sources (e.g., people, places, artifacts). Tools/products/workflows: image description and text-bridging pipeline, evidence-backed tagger, audit trail. Assumptions/dependencies: content licensing; scalable indexing; downstream search compatibility.
- Scientific figure and diagram reference finder — Sectors: academia, education. Crops panels, plots, or labels from figures to find source papers/web entries and compile citations. Tools/products/workflows: figure cropper, reverse image search connector, citation generator with LLM-as-Judge validation. Assumptions/dependencies: paywalled content handling; citation style compliance; accurate figure-to-source matching.
- Accessibility alt-text generation with provenance grounding — Sectors: accessibility, public sector. Produces descriptive alt-text linked to verified public references (museum pages, event archives), reducing hallucinations. Tools/products/workflows: web visit summarizer conditioned on visual hits, provenance-attached alt-text generator. Assumptions/dependencies: careful summarization to avoid PII; content licenses; human QA for critical pages.
- Open-source adoption of the training and rollout recipes — Sectors: AI research, ML engineering. Reuses the authors’ code to add multi-turn tool-use (SFT) and stabilize long-horizon RL with asynchronous rollouts and LLM-as-Judge. Tools/products/workflows: Vision-DeepResearch codebase, rLLM-based asynchronous rollouts, GRPO training with safeguards (format error handling, repetition detectors, masked trajectories). Assumptions/dependencies: compute (long contexts up to 64K tokens), reliable tool APIs, reproducible data pipelines.
Long-Term Applications
The following applications require further research, productization, scaling, or domain adaptation (e.g., stronger safety, reliability, and compliance), but are directly enabled by the paper’s long-horizon multimodal deep-research paradigm.
- General multimodal research assistants for investigative journalism — Sectors: media, policy. Autonomous agents that iteratively crop, search, cross-verify, and compile evidence chains for complex investigations. Tools/products/workflows: end-to-end agent with hundreds of tool interactions, provenance graph builder, editorial review interface. Assumptions/dependencies: robust reliability metrics, legal-safe crawling, standardized evidence logging.
- Autonomous web agents for GUI navigation and task completion using screenshots — Sectors: RPA, enterprise IT. Agents that understand screenshots, identify UI affordances, and execute multi-step workflows with web search and documentation grounding. Tools/products/workflows: GUI agent combining vision grounding with ReAct planning, tool ecosystem for secure actions. Assumptions/dependencies: strong safety/alignment; permissioned automation; resistance to anti-bot measures.
- Cross-modal search engine redesign — Sectors: search, software. Native support for multi-entity, multi-scale image queries blended with multi-hop text search and provenance-aware ranking. Tools/products/workflows: “Box Search” API, parallel crop indexing, multimodal ranking with evidence aggregation. Assumptions/dependencies: search infra overhaul; new UX patterns; user education on multi-crop querying.
- Real-time video deep-research (live streams, sports, security) — Sectors: security, sports analytics, broadcasting. Multi-frame cropping and iterative search to identify people, events, venues, and context during live content. Tools/products/workflows: streaming crop-to-search engine, low-latency summarization, alerting. Assumptions/dependencies: compute budgets; latency constraints; privacy and legal compliance for live analysis.
- Robotics perception-to-knowledge retrieval — Sectors: robotics, manufacturing. Robots use crop-to-search to map unknown objects to external knowledge (manuals, specs, safety guidance), enabling adaptive behaviors. Tools/products/workflows: on-device visual entity localization, networked knowledge retriever, action planner. Assumptions/dependencies: network connectivity, safety-critical validation, domain-specific ontologies.
- Healthcare clinical decision support grounded in medical imagery — Sectors: healthcare. From imaging snapshots (e.g., device labels, radiology figures), agents link to vetted clinical guidelines or device recall notices. Tools/products/workflows: regulated domain search connectors, provenance logs, clinician-in-the-loop interfaces. Assumptions/dependencies: HIPAA/GDPR compliance; domain adaptation for medical content; liability frameworks; rigorous evaluation.
- Legal-grade evidence chains with provenance and citations — Sectors: legal, compliance. Standardized workflows that log visual crops, search queries, page visits, summaries, and model decisions to produce auditable chains of custody. Tools/products/workflows: immutable evidence ledger, watermarking, citation generator with judge models. Assumptions/dependencies: standards bodies, admissibility criteria, tamper-evident logging.
- Education platforms teaching iterative multimodal search literacy — Sectors: education. Curricula and tools that train students to refine image queries, perform multi-hop search, and synthesize cross-modal evidence. Tools/products/workflows: sandboxed agents with transparent trajectories, rubric-aligned evaluation. Assumptions/dependencies: classroom integration; guardrails for safe web use; robust assessment design.
- Multimodal e-discovery and corporate investigations — Sectors: compliance, finance. Large-scale ingestion of images, PDFs, screenshots, and linked web content to find relevant evidence and relationships. Tools/products/workflows: scalable crop-to-search pipelines, provenance-aware search, investigator dashboards. Assumptions/dependencies: privacy controls; legal oversight; high-throughput tooling.
- Open benchmarks and standards for agentic multimodal deep-research — Sectors: academia, policy. Community-maintained datasets, metrics (hit-rate under noise, trajectory efficiency), and protocols for long-horizon multimodal agents. Tools/products/workflows: VDR-Bench expansion, provenance-aware evaluation, safety test suites. Assumptions/dependencies: consortium governance; dataset licensing; reproducibility infrastructure.
- Consumer “photo memory” assistants with verified context — Sectors: consumer software. Personal agents that link users’ photos to events, places, and people with grounded references and privacy-preserving summaries. Tools/products/workflows: on-device description + selective web bridging, user consent flows, privacy controls. Assumptions/dependencies: strong privacy; opt-in data handling; efficient on-device inference.
Collections
Sign up for free to add this paper to one or more collections.