HourVideo: 1-Hour Video-Language Understanding (2411.04998v1)

Published 7 Nov 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu

PDF HTML Abstract

Overview of "HourVideo: 1-Hour Video-Language Understanding"

The paper "HourVideo: 1-Hour Video-Language Understanding" introduces a benchmark designed to evaluate the capability of multimodal models in processing and understanding long, egocentric video streams. The benchmark is a significant advancement in the field of artificial intelligence, aiming to bridge the current gap between human-level and machine-level comprehension of extended visual stimuli in naturalistic settings.

Dataset and Task Suite

HourVideo utilizes a meticulously curated dataset derived from the Ego4D dataset, consisting of 500 videos with durations ranging from 20 to 120 minutes, averaging 45.7 minutes per video. This selection aims to push current multimodal models beyond the limitations of short-form content, which typically encapsulates seconds to a few minutes of footage. The benchmark is comprised of 12,976 multiple-choice questions devised to challenge models on various dimensions of video understanding.

The task suite introduced includes:

Summarization: Tasks requiring models to capture and condense key events or sequences in the lengthy videos.
Perception: Encompassing factual recall, sequence recall, and tracking tasks that require the model to identify and remember details over time.
Visual Reasoning: This involves spatial, temporal, predictive, causal, and counterfactual reasoning, testing the model's ability to understand complex interactions within the video.
Navigation: Tasks like room-to-room movement and object retrieval that simulate practical autonomous navigation.

Benchmark Results

The paper evaluates state-of-the-art models such as GPT-4V, LLaVA-NeXT, and Gemini Pro 1.5, revealing that these models exhibit only minimal performance improvements over random chance in these tasks. The discrepancy in performance between AI models and human experts is stark, with humans achieving 85% accuracy compared to Gemini Pro 1.5's 37.3% — the highest among the tested models. This performance gap underscores the significant challenges that remain in developing models capable of matching human cognitive abilities in understanding long-form, complex video content.

Implications and Future Directions

The implications of this research are twofold. Practically, enhancing AI models to understand extended video content could lead to more sophisticated autonomous systems, capable of tasks such as navigating through complex environments, understanding human activities at a nuanced level, and assisting humans in varied applications from augmented reality to surveillance. Theoretically, the work emphasizes the necessity for models that can integrate and synthesize information over prolonged temporal sequences, highlighting the limitations of current architectures which often lack the required long-term memory and reasoning capabilities.

Future developments in AI could focus on integrating better memory architectures and more advanced reasoning capabilities. Additionally, the pursuit of improved multimodal understanding might encourage research into cross-modal learning where models leverage complementary information from multiple sensory inputs, much like humans do naturally.

Conclusion

The HourVideo benchmark sets a new standard for evaluating the video comprehension capacity of multimodal models. It highlights existing deficiencies in model capabilities and provides a focused pathway for future research efforts aimed at achieving human-like understanding of long-stream visual data. Through this benchmark, the authors encourage further exploration into the complexities of long-form video comprehension, with the goal of narrowing the gap between human and machine intelligence in this domain.