The paper, titled "E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding," introduces a novel benchmarking framework designed to evaluate Video-LLMs at a fine-grained event level, beyond conventional video-level assessments. E.T. Bench serves as a comprehensive benchmark employing a robust three-level task taxonomy characterized by 12 distinct tasks and incorporating a diverse video set with a length of 251.4 hours sourced from 8 domains.
Key insights from the benchmark:
- Scope and Structure of the Benchmark:
- Events and Domains: E.T. Bench evaluates 7.3K samples across 7K videos distributed among 8 distinct domains.
- Tasks and Capabilities: The 12 tasks are centered on 4 core capabilities necessary for precise time-sensitive video understanding—referring, grounding, dense captioning, and complex contextual comprehension.
- Evaluation and Baseline Model:
- The benchmark highlights that existing state-of-the-art Video-LLMs, despite their prowess in video-level understanding, struggle significantly in tasks demanding fine-grained temporal and event localization due to issues such as inadequate video context length and ineffective temporal representation.
- To tackle these issues, the authors propose E.T. Chat, a robust baseline model enhanced by an instruction-tuning dataset tailored specifically for event-level understanding in videos, titled E.T. Instruct 164K. This dataset aids in training models to excel at fine-grained, multi-event, and time-sensitive tasks.
- Quantitative Performance and Benchmarks:
- An extensive evaluation involving 20 distinct models consisting of 8 Image-LLMs and 12 Video-LLMs is conducted. The results indicate that Video-LLMs need enhanced modeling techniques, notably in timestamp comprehension and multi-event training, to be proficient in fine-grained video tasks.
- Technical challenges identified in the performance of leading-edge models in the grounding, dense captioning, and complex comprehension tasks suggest inherent flaws in the discrete token prediction paradigms adopted by many MLLMs, alongside training datasets typically restricted to short, simple video content.
- Proposed Solutions:
- E.T. Chat formulates the timestamp prediction challenge as an embedding matching problem rather than a direct prediction task, which is substantiated by superior performance as demonstrated in diverse contexts included in E.T. Bench.
- E.T. Instruct 164K, a pivotal element of the benchmarking framework, comprises an extensive set of instruction-tuning data crafted for nurturing Video-LLMs with enriched multi-event and temporal context comprehension capabilities.
Overall, the proposed E.T. Bench framework and complementary E.T. Chat model represent a significant step towards overcoming existing limitations of Video-LLMs, offering a scaffold for further developments in time-sensitive and multi-event video understanding.