VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (2405.19209v2)

Published 29 May 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Long-form video understanding has been a challenging task due to the high redundancy in video data and the abundance of query-irrelevant information. To tackle this challenge, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multigranularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that our training-free method improves both reasoning accuracy and efficiency compared to existing methods. Specifically, VideoTree outperforms the existing training-free approaches on the popular EgoSchema and NExT-QA benchmarks with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. Moreover, on the long split of Video-MME benchmark (average 44 minutes), the training-free VideoTree framework achieves better performance than the strong proprietary GPT-4V model and other MLLMs that were extensively trained on video data.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (85)

Authors (7)

Ziyang Wang (59 papers)
Shoubin Yu (15 papers)
Elias Stengel-Eskin (49 papers)
Jaehong Yoon (43 papers)
Feng Cheng (37 papers)
Gedas Bertasius (55 papers)
Mohit Bansal (304 papers)

Citations (22)

View on Semantic Scholar

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (2405.19209v2)

Related Papers

Tweets