Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding (2505.18079v2)

Published 23 May 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While LLMs have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

Summary

Deep Video Discovery: Enhancing Long-form Video Understanding through Autonomous Agentic Search

The paper, "Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding," addresses the formidable challenge of comprehending long-form video content. This task is further complicated by the extensive temporal-spatial complexity and the inherent difficulty of contextual question answering over extended video lengths. To circumvent the limitations associated with processing lengthy information-dense videos using LLMs, the authors propose the Deep Video Discovery (DVD) agent, focusing on agentic search strategies applied to segmented video clips. This approach distinguishes itself by emphasizing the autonomous nature of agents, aiming to enhance understanding and retrieval efficiency without rigidly defined task workflows.

Methodology

The core of the DVD agent's approach involves leveraging a search-centric toolkit tailored for analyzing long videos. This toolkit is integrated with a multi-granular video database, facilitating the application's advanced reasoning capabilities of LLMs in understanding hour-long video sequences. The toolkit comprises three key components:

Global Browse: Responsible for summarizing and indexing global subjects and contexts, enabling a high-level overview of the entire video.
Clip Search: Executes efficient semantic retrieval of events within segmented clips, enhancing the ability to pinpoint relevant queries.
Frame Inspect: Provides the capability to extract pixel-level specific details when precise temporal range information is necessary.

The DVD agent autonomously strategizes tool usage based on its evolving observation state, using the advanced reasoning abilities inherent in LLMs. This allows the agent to dynamically adapt its approach to effectively handle diverse questions that span complex temporal and spatial dimensions.

Experimental Evaluation

The proposed DVD framework undergoes rigorous evaluation against several long video understanding benchmarks. The system achieves state-of-the-art performance on the LVBench dataset, boasting an accuracy of 71.9%, which is a substantial advancement over previously documented attempts. Further optimization, incorporating transcript data, elevates this accuracy to 74.1%. The suite of ablation studies conducted supports the robustness and effectiveness of the DVD toolset in enhancing understanding capabilities.

Implications and Future Directions

The implications of this research are substantial, both practically and theoretically. Practically, the approach can significantly expedite the processing and comprehension of long-form video content, which is increasingly prevalent across various domains such as entertainment, education, and corporate sectors. Theoretically, the findings underscore the potential of coupling agent-based autonomous reasoning with tool use in transforming how such complex tasks are approached.

Future developments in AI may include expanding the scope of autonomous agents to integrate more sophisticated reasoning capabilities and exploration strategies, potentially applying similar frameworks to other domains requiring nuanced understanding of broad and complex information datasets. This research opens pathways not only for improving LLM handling of extensive data but also for better integrating multi-modal information processing within autonomous systems.

Related Papers

Tweets

https://twitter.com/h1ckok/status/1939472795362074650

https://twitter.com/outerport/status/1926825923103326353

YouTube

Show All Videos