VideoDeepResearch: Long Video Understanding With Agentic Tool Using (2506.10821v2)

Published 12 Jun 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal LLMs (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

Summary

The paper introduces an agentic framework that decomposes long video tasks by orchestrating a text-only reasoning model with a suite of targeted multi-modal tools.
The paper demonstrates significant performance improvements over traditional MLLMs, achieving up to 9.6% gains on benchmarks while using fewer visual tokens.
The paper highlights the practical impact of modular integration, enabling cost-effective, scalable, and flexible long video understanding in real-world applications.

Agentic Systems for Long Video Understanding: A Critical Analysis of VideoDeepResearch

VideoDeepResearch proposes a modular, agentic framework for Long Video Understanding (LVU) that departs from the prevailing emphasis on large, monolithic MLLMs with ever-increasing context windows. Instead, the framework orchestrates a text-only large reasoning model (LRM) with a suite of specialized, readily available multi-modal tools, enabling dynamic and selective access to video content. This architectural choice addresses both efficiency and generalization issues inherent in current approaches to LVU.

Conceptual Framework and Methodological Innovations

The core components of VideoDeepResearch include:

Text-only LRM: Serves as the reasoning engine, decomposing complex tasks, planning tool invocation, and synthesizing final answers.
Modular Tool Suite:
- Video Clip Retriever: Efficiently locates temporally relevant clips using pre-segmentation and retrieval methods, supporting both textual and multimodal queries.
- Subtitle Retriever/Extractor: Provides access to temporally localized, dialogue-dependent segments for audio-centric queries.
- Visual Perceiver: Delegates fine-grained visual understanding of localized clips to established vision-LLMs, restricting their use to precise segments.
- Video Browser: Performs high-level, global analysis through sparse sampling—akin to rapid human inspection.

A key aspect is the system's iterative interaction—the LRM reasons, calls tools, updates its knowledge context, and iteratively narrows down information requirements until the query is resolved. This echoes agentic paradigms seen in recent research on tool-using LLMs, now extended efficiently to the LVU domain where context windows are a primary bottleneck.

Implementation Details and Practicality

The authors implement VideoDeepResearch using:

Pre-segmented 10-second video clips for efficient retrieval.
LanguageBind-large as the retriever backbone and DeepSeek-r1-0528 as the primary LRM.
Visual perceivers such as Qwen2.5VL-7B and Seed1.5VL-Pro, each processing up to 32 high-resolution frames per segment.
Subtitles, when available, to enhance both retrieval precision and comprehension for audio queries.

The framework is publicly available and fully compatible with open-source and proprietary LLM tooling. Tool interfaces are clearly abstracted, allowing easy substitution of underlying retrieval, perception, or reasoning models as these components advance independently.

Empirical Results and Comparative Analysis

VideoDeepResearch demonstrates noteworthy numerical improvements over both proprietary and open-source MLLMs:

Model	#Frames	MLVU	LVBench	VideoMME(L)	LongVideoBench	Avg
Qwen2.5VL-7B (baseline)	128	47.4	44.8	57.6	58.0	52.0
Qwen2.5VL-7B + RAG	128	48.9	47.2	58.2	58.4	53.2
VideoDeepResearch(Qwen2.5VL)	32	55.9	50.7	72.4	64.1	60.8
GPT-4o	384	54.9	48.9	72.1	66.7	60.6
VideoDeepResearch(Seed1.5VL)	32	64.5	55.5	76.3	70.6	66.7

Performance claims include:

Improvements over state-of-the-art baselines by 9.6%, 6.6%, and 3.9% on MLVU, LVBench, and LongVideoBench respectively.
Superior average scores from a single, modular 32-frame perceiver chain, outperforming GPT-4o across all reported benchmarks.
Markedly better efficiency: The system uses 17–25% fewer visual tokens than baselines, offering cost-effectiveness suitable for real-world deployment.

Ablation studies demonstrate gains are especially strong for tasks requiring fine-grained retrieval and multi-step reasoning (NeedleQA, action sequence tasks, anomaly detection), where brute-force context extension and naive RAG approaches are ineffective. The method is less effective in domains (EgoQA/SportsQA) where the retrieval module struggles to locate pertinent segments—highlighting a dependency on retrieval and suggesting an avenue for future research.

Theoretical and Practical Implications

Theoretical Implications:

The work demonstrates the sufficiency of agentic orchestration for LVU, challenging the conventional wisdom that only larger and broader-context MLLMs can solve complex, real-world LVU tasks.
The dynamic, stepwise reasoning and tool-using approach aligns with current trends in combining LLMs with discrete, specialized modules, reinforcing the view that composite systems may supersede end-to-end monoliths in multi-modal reasoning domains.

Practical Implications:

The modularity enables flexible and cost-effective deployment across heterogeneous environments, including those with hardware or latency constraints.
The agentic pattern scales well with the length of the video, as performance is correlated to the informativeness of selected clips rather than the total video length or context window size.
Integration of new tools (retrievers, perceivers) or upgrades to backend models requires no retraining or redesign of the overall controller.

Future Developments and Open Questions

Improvements in retrieval accuracy will directly impact overall system robustness, particularly in tasks with diffuse or ambiguous video evidence.
Advancements in visual perceivers and domain-adaptive retrievers will likely increase both the breadth and depth of supported video understanding tasks.
The agentic pattern described here could generalize to other multi-modal domains, such as document-level vision or spatio-temporal sensor fusion, suggesting a paradigm shift toward modular, LLM-orchestrated AI systems.

Overall, VideoDeepResearch provides compelling empirical evidence and a practical blueprint for agentic, tool-using modular architectures in LVU, marking a shift away from the scaling limitations of monolithic MLLMs. Future multi-modal applications may benefit from adopting similar system-level perspectives, emphasizing reasoning, selectivity, and modular integration over sheer modeling scale.

PDF Markdown