E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (2409.18111v1)

Published 26 Sep 2024 in cs.CV

Abstract: Recent advances in Video LLMs (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

PDF HTML Abstract

The paper, titled "E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding," introduces a novel benchmarking framework designed to evaluate Video-LLMs at a fine-grained event level, beyond conventional video-level assessments. E.T. Bench serves as a comprehensive benchmark employing a robust three-level task taxonomy characterized by 12 distinct tasks and incorporating a diverse video set with a length of 251.4 hours sourced from 8 domains.

Key insights from the benchmark:

Scope and Structure of the Benchmark:
- Events and Domains: E.T. Bench evaluates 7.3K samples across 7K videos distributed among 8 distinct domains.
- Tasks and Capabilities: The 12 tasks are centered on 4 core capabilities necessary for precise time-sensitive video understanding—referring, grounding, dense captioning, and complex contextual comprehension.
Evaluation and Baseline Model:
- The benchmark highlights that existing state-of-the-art Video-LLMs, despite their prowess in video-level understanding, struggle significantly in tasks demanding fine-grained temporal and event localization due to issues such as inadequate video context length and ineffective temporal representation.
- To tackle these issues, the authors propose E.T. Chat, a robust baseline model enhanced by an instruction-tuning dataset tailored specifically for event-level understanding in videos, titled E.T. Instruct 164K. This dataset aids in training models to excel at fine-grained, multi-event, and time-sensitive tasks.
Quantitative Performance and Benchmarks:
- An extensive evaluation involving 20 distinct models consisting of 8 Image-LLMs and 12 Video-LLMs is conducted. The results indicate that Video-LLMs need enhanced modeling techniques, notably in timestamp comprehension and multi-event training, to be proficient in fine-grained video tasks.
- Technical challenges identified in the performance of leading-edge models in the grounding, dense captioning, and complex comprehension tasks suggest inherent flaws in the discrete token prediction paradigms adopted by many MLLMs, alongside training datasets typically restricted to short, simple video content.
Proposed Solutions:
- E.T. Chat formulates the timestamp prediction challenge as an embedding matching problem rather than a direct prediction task, which is substantiated by superior performance as demonstrated in diverse contexts included in E.T. Bench.
- E.T. Instruct 164K, a pivotal element of the benchmarking framework, comprises an extensive set of instruction-tuning data crafted for nurturing Video-LLMs with enriched multi-event and temporal context comprehension capabilities.

Overall, the proposed E.T. Bench framework and complementary E.T. Chat model represent a significant step towards overcoming existing limitations of Video-LLMs, offering a scaffold for further developments in time-sensitive and multi-event video understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Ye Liu (153 papers)
Zongyang Ma (11 papers)
Zhongang Qi (40 papers)
Yang Wu (175 papers)
Ying Shan (252 papers)
Chang Wen Chen (58 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/javaeeeee1/status/1842618793014722945