Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Published 11 Jan 2026 in cs.CV and cs.AI | (2601.06943v1)

Abstract: In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal LLMs under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

Abstract PDF Upgrade to Chat

Summary

The paper introduces VideoDR, a benchmark that uniquely integrates multi-frame visual anchors and open-web search for comprehensive agentic video QA evaluation.
The benchmark enforces multi-hop reasoning that demands joint analysis of video and web evidence, exposing challenges like goal drift and long-term state maintenance.
Comparative experiments reveal that high-capacity closed-source models outperform open-source ones, underscoring persistent issues in multimodal evidence integration.

A Benchmark for Agentic Video Deep Research: VideoDR

Introduction

VideoDR targets a high-fidelity evaluation of agentic, open-domain video question answering where models must synthesize information from both dynamic video content and the open web. Unlike prior closed-evidence video QA or deep research benchmarks that use text-only queries, VideoDR uniquely integrates multi-frame visual anchor extraction, interactive web retrieval, and multi-hop evidence-based reasoning. The benchmark, annotated through rigorous multi-stage quality control, systematically excludes instances solvable with either video or web content alone, enforcing tasks that demand joint video--web evidence integration. This approach specifically exposes major, unsolved challenges in agentic multimodal reasoning, including persistent bottlenecks in goal drift and long-horizon consistency in agentic architectures.

Figure 1: Overview of the VideoDR construction pipeline.

Problem Formulation and Benchmark Construction

VideoDR formalizes task instances as $(V, Q; S) \to A$ , where, given a video $V$ , a natural language question $Q$ , and browser-based search tool $S$ , agents must extract cross-temporal visual anchors, iteratively interact with the open web, and construct a verifiable answer $A$ . The construction pipeline involves: stratified video selection from diverse sources and semantic domains, aggressive filtering to exclude trivial or non-verifiable facts, and expert-made multi-hop questions that strictly require multi-frame reasoning and interleaved search.

An example task illustrates the required capabilities: recognizing a museum from video cues, then finding and outputting a specific exhibit's accession number via multi-hop web search. The design expressly precludes single-frame sufficiency and ensures web search alone is insufficient, demanding joint spatiotemporal analysis and text-based retrieval.

Figure 2: An example of the VideoDR task, highlighting cross-modal, multi-hop reasoning anchored by multi-frame video cues and open-web search.

Data Statistics and Structural Properties

The final dataset comprises 100 samples balanced across six domains--Daily Life, Economics, Technology, Culture, History, and Geography. Natural language question lengths are concise, averaging 25.54 tokens, minimizing input complexity and emphasizing reasoning over multi-modal evidence. Video durations feature a long-tailed distribution, supporting both short- and long-horizon anchor localization evaluations.

Figure 3: Data statistics of VideoDR, including domain balance, question length distribution centered at 25 tokens, and a long-tailed video duration profile.

Experimental Paradigms and Baselines

VideoDR supports comparison between two distinct agent architectures:

Workflow paradigm: A two-stage system first extracts structured multi-frame video cues, then passes these as input for search and reasoning, externalizing visual cues to a stable textual intermediate representation.
Agentic paradigm: An end-to-end agent receives video and question as input, autonomously performing perception, query generation, search, evidence integration, and answer synthesis in a single execution loop, without explicit intermediate state persistence.

Mainstream MLLMs are benchmarked under both paradigms, spanning closed-source (GPT-4o, Gemini-3-pro-preview, GPT-5.2) and open-source (MiniCPM-V 4.5, Qwen3-Omni-30B-A3B, InternVL3.5-14B) models. An LLM-as-judge protocol (DeepSeek-V3) is employed to ensure robust, semantically-aligned evaluation.

Main Results: Performance Stratification and Bottleneck Analysis

The results expose a clear stratification of model capabilities. Top closed-source models (Gemini-3-pro-preview, GPT-5.2) achieve upper-bound accuracies of 76% and 69%, respectively, considerably outperforming open-source systems, which peak at 37%. Human upper-bound accuracy is estimated at 50.4%, underscoring task challenge and annotation fidelity.

Difficulty, video duration, and semantic domain stratifications provide insight into core agentic reasoning barriers:

Difficulty model: As question difficulty (human success rate) increases, all models and humans exhibit sharply declining performance, reflecting the fragility of extended evidence chains.
Agentic vs Workflow: The agentic setting yields significant gains only for high-capacity models on more complex samples, but can degrade mid-tier model performance on long or difficult tasks due to goal drift: inability to reliably maintain and exploit video anchors across multi-hop retrieval.
Durational effects: For longer videos, agentic models require robust long-term state maintenance. Gemini-3-pro-preview and GPT-5.2 leverage initial cue retention to excel, whereas lower-performing models suffer from state drift and increased error propagation.
Domain effects: The agentic advantage is most pronounced for Technology, where precise multimodal-to-query translation is critical. In more ambiguous domains (e.g., Geography), agentic search increases drift and degrades results unless visual anchors are highly distinctive.
Figure 4: Human solvability rates across difficulty levels, supporting stratified analysis of agentic performance bottlenecks.

Tool-Use and Error Taxonomy

Tool usage analysis indicates that search/think call count alone does not account for outcome variance; rather, tool-use effectiveness is a function of evidence path quality and anchor retention. For example, Gemini-3-pro-preview's increased search/think usage corresponds to more reliable multi-modal evidence integration, but for mid/low tier models additional tool use often amplifies drift and error rates.

Error analysis confirms that Categorical Error (incorrect anchor categorization and alignment) dominates, increasing in agentic settings without persistent intermediate cues. Numerical errors remain a distinct, persistent challenge across all models, highlighting ongoing limitations in fine-grained information extraction from imperfect, multi-modal web evidence.

Implications and Future Directions

Practically, VideoDR exposes the limits of current agentic MLLMs in long-horizon, information-seeking tasks where precision spatiotemporal grounding and stable query propagation are necessary. The strong performance of Gemini-3-pro-preview and GPT-5.2 underlines the promise of closed-source, large-scale models, but no current architecture reliably solves the state maintenance and goal alignment problems highlighted by the benchmark.

Theoretically, these findings frame open questions in the development of next-generation video agents:

How can agentic systems maintain robust, persistent visual-textual state for long-horizon, multi-modal tasks without drift?
What forms of externalized memory or anchor-persistent intermediate representations most effectively support high-yield, error-resistant multi-hop search?
How can benchmark design further isolate compositional and grounding errors for more targeted architectural innovation?

Conclusion

VideoDR establishes a challenging, systematic benchmark for agentic video deep research, incorporating diverse video categories, cross-frame anchor extraction, open-web retrieval, and multi-hop reasoning. Comprehensive benchmarking and stratified analysis reveal critical inadequacies in current multimodal LLM architectures, especially with regard to anchor propagation and long-horizon reasoning consistency. Progress on VideoDR will require research into state-persistent agentic systems, memory-augmented architectures, and enhanced visual-textual grounding for verifiable, real-world QA.