VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (2405.19209v2)
Abstract: Long-form video understanding has been a challenging task due to the high redundancy in video data and the abundance of query-irrelevant information. To tackle this challenge, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multigranularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that our training-free method improves both reasoning accuracy and efficiency compared to existing methods. Specifically, VideoTree outperforms the existing training-free approaches on the popular EgoSchema and NExT-QA benchmarks with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. Moreover, on the long split of Video-MME benchmark (average 44 minutes), the training-free VideoTree framework achieves better performance than the strong proprietary GPT-4V model and other MLLMs that were extensively trained on video data.
- Hiervl: Learning hierarchical video-language embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023.
- Memory consolidation enables long-context video understanding. arXiv preprint arXiv:2402.05861, 2024.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), July 2021.
- Revisiting the “Video” in Video-Language Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Video ChatCaptioner: Towards enriched spatiotemporal descriptions, 2023.
- F. Cheng and G. Bertasius. TALLFormer: Temporal action localization with a long-memory transformer, 2022.
- VindLU: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10739–10750, June 2023.
- DAM: Dynamic adapter merging for continual video QA learning, 2024.
- Zero-shot video question answering with procedural programs. arXiv preprint arXiv:2312.00937, 2023.
- J. Chung and Y. Yu. Long Story Short: a summarize-then-search method for long video question answering, 2023.
- Videoagent: A memory-augmented multimodal agent for video understanding. arXiv preprint arXiv:2403.11481, 2024.
- Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024.
- CogAgent: A visual language model for GUI agents, 2023.
- VTimeLLM: Empower LLM to grasp video moments, 2023.
- Everest: Efficient masked video autoencoder by removing redundant spatiotemporal tokens. In International Conference on Machine Learning, 2024.
- Video ReCap: Recursive captioning of hour-long videos. arXiv preprint arXiv:2402.13250, 2024.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017.
- Chat-UniVi: Unified visual representation empowers large language models with image and video understanding, 2024.
- Language repository for long video understanding. arXiv preprint arXiv:2403.14622, 2024.
- An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024.
- Large language models are temporal and causal reasoners for video question answering, 2023.
- Text-conditioned resampler for long form video understanding, 2024.
- Revealing single frame bias for video-and-language learning, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
- Intentqa: Context-aware video intent reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11963–11974, 2023b.
- VideoChat: Chat-centric video understanding, 2024a.
- Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024b.
- MVBench: A comprehensive multi-modal video understanding benchmark, 2024c.
- HERO: Hierarchical encoder for video+language omni-representation pre-training, 2020.
- LLMs meet long video: Advancing long video comprehension with an interactive visual adapter in LLMs, 2024d.
- Video-llava: Learning united visual representation by alignment before projection, 2023a.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023b.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
- MM-VID: Advancing video understanding with GPT-4V(ision), 2023c.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023.
- Vista-LLaMA: Reliable video narrator via equal distance to visual tokens, 2023.
- Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024.
- J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36, 2024.
- MoReVQA: Exploring modular reasoning models for video question answering. arXiv preprint arXiv:2404.06511, 2024.
- Pg-video-llava: Pixel grounding large video-language models. arXiv preprint arXiv:2311.13435, 2023.
- OpenAI. GPT-4 technical report, 2023.
- A simple recipe for contrastively pre-training video-first encoders beyond 16 frames, 2023.
- Momentor: Advancing video large language model with fine-grained temporal reasoning, 2024.
- Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency, 2022.
- Understanding long videos in one multimodal language model pass, 2024.
- TimeChat: A time-sensitive multimodal large language model for long video understanding, 2024.
- TV-TREES: Multimodal entailment trees for neuro-symbolic video reasoning, 2024.
- Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
- Moviechat: From dense token to sparse memory for long video understanding, 2024.
- EVA-CLIP-18B: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252, 2024.
- Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
- Koala: Key frame-conditioned long video-LLM, 2024.
- Sok-bench: A situated video reasoning benchmark with aligned open-world knowledge. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024a.
- Long-short temporal contrastive learning of video transformers, 2022a.
- ChatVideo: A tracklet-centric multimodal and versatile video understanding system, 2023a.
- OmniVid: A generative framework for universal video understanding, 2024b.
- Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023b.
- Vamos: Versatile action models for video understanding, 2023c.
- Vila: Efficient video-language alignment for video question answering, 2024c.
- VideoAgent: Long-form video understanding with large language model as agent. arXiv preprint arXiv:2403.10517, 2024d.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022b.
- Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024e.
- LSTP: Language-guided spatial-temporal prompt learning for long-form video-text understanding, 2024f.
- LifelongMemory: Leveraging LLMs for answering queries in long-form egocentric videos, 2024g.
- Unified coarse-to-fine alignment for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2816–2827, October 2023d.
- Gpt4Video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation, 2023e.
- Longvlm: Efficient long video understanding via large language models, 2024.
- Star: A benchmark for situated reasoning in real-world videos. arXiv preprint arXiv:2405.09711, 2024.
- MeMViT: Memory-augmented multiscale vision transformer for efficient long-term video recognition, 2022.
- Hierarchical self-supervised representation learning for movie understanding, 2022.
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
- Retrieval-based video language model for efficient long video question answering. arXiv preprint arXiv:2312.04931, 2023.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
- Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning, 2023.
- DoraemonGPT: Toward understanding dynamic scenes with large language models (exemplified as a video agent), 2024.
- Self-chained image-language model for video localization and question answering. Advances in Neural Information Processing Systems, 36, 2024a.
- CREMA: Multimodal compositional video reasoning via efficient modular adaptation and fusion. arXiv preprint arXiv:2402.05889, 2024b.
- Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23056–23065, June 2023.
- Cross-modal and hierarchical modeling of video and text. In Proceedings of the european conference on computer vision (ECCV), pages 374–390, 2018.
- A simple LLM framework for long-range video question-answering. arXiv preprint arXiv:2312.17235, 2023a.
- Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023b.
- Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
- Ziyang Wang (59 papers)
- Shoubin Yu (15 papers)
- Elias Stengel-Eskin (49 papers)
- Jaehong Yoon (43 papers)
- Feng Cheng (37 papers)
- Gedas Bertasius (55 papers)
- Mohit Bansal (304 papers)