Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

194 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

TempCompass: Do Video LLMs Really Understand Videos? (2403.00476v3)

Published 1 Mar 2024 in cs.CV

Abstract: Recently, there is a surge in interest surrounding video LLMs (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share the same static content but differ in a specific temporal aspect, which prevents Video LLMs from leveraging single-frame bias or language priors. (2) To collect the task instructions, we propose a paradigm where humans first annotate meta-information for a video and then an LLM generates the instruction. We also design an LLM-based approach to automatically and accurately evaluate the responses from Video LLMs. Based on TempCompass, we comprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs, and reveal the discerning fact that these models exhibit notably poor temporal perception ability. Our data will be available at https://github.com/llyx97/TempCompass.

References (62)

Citations (45)

View on Semantic Scholar

Summary

The paper introduces TempCompass, a novel benchmark assessing Video LLMs’ ability to interpret diverse temporal dynamics including action, speed, and event order.
It employs varied task formats, such as multi-choice, yes/no, and caption generation, designed to overcome single-frame biases and language priors.
Evaluation of 8 Video LLMs and 3 Image LLMs reveals limited temporal perception in video models, highlighting critical areas for future improvements.

Evaluating the Temporal Perception of Video LLMs with TempCompass

Introduction to TempCompass

In the expanding domain of LLMs with video understanding capabilities, popularly known as Video LLMs, the newly introduced TempCompass benchmark emerges as a comprehensive framework for evaluating these models' temporal perception abilities. Differentiating itself from existing benchmarks, TempCompass assesses Video LLMs' capacity to understand diverse temporal aspects such as action, speed, direction, attribute change, and event order through a variety of task formats including Multi-Choice Question-Answering (QA), Yes/No QA, Caption Matching, and Caption Generation. The benchmark is meticulously designed to challenge models beyond single-frame biases and language priors, thus providing a more holistic view of a model's video understanding capabilities.

Addressing the Gap in Video LLM Evaluation

The TempCompass benchmark was developed in response to certain limitations observed in previous approaches to evaluating Video LLMs:

Limited Temporal Aspect Differentiation: Prior benchmarks often conflated different temporal dynamics, hindering a nuanced evaluation of models' understanding of specific temporal properties.
Constrained Task Format Variety: Most benchmarks primarily utilized multi-choice QA formats, overlooking the potential insights offered by diverse evaluation methods.

TempCompass aims to fill these gaps by incorporating a range of temporal aspects and task formats, facilitating a detailed assessment of Video LLMs. This diversity not only challenges models across multiple dimensions of temporal understanding but also enables a richer analysis of their performance.

Benchmark Creation and Methodology

The creation of TempCompass involved several innovative strategies:

Conflicting Video Pair/Triplets: To counteract the reliance on single-frame biases and language priors, TempCompass includes videos with identical static content but varying in specific temporal axes. This design ensures that accurate task completion relies on genuine video understanding.
Hybrid Data Collection Approach: Combining human annotations with LLM-generated content, TempCompass achieves a balance of efficiency and quality in its dataset. Task instructions were primarily generated by an LLM, with human oversight ensuring relevance and clarity.

Another cornerstone of TempCompass is its automatic evaluation methodology, leveraging a well-tuned LLM to assess Video LLM responses. This approach, especially relevant for free-response formats, showcases the potential for sophisticated, automated evaluation frameworks in AI research.

Insights from TempCompass Evaluation

Evaluating 8 state-of-the-art Video LLMs and 3 Image LLMs, TempCompass unveiled several key findings:

Underdeveloped Temporal Perception: Across the board, Video LLMs demonstrated limited abilities in interpreting temporal dynamics, struggling even against Image LLMs in certain aspects.
Aspect and Task-Specific Performance Variance: The benchmark highlighted not only the varying levels of models' proficiency across different temporal aspects but also the significance of task format on performance.

These results underline the essential need for further advancements in Video LLM technology, with a particular emphasis on improving temporal perception.

Future Directions and Limitations

While TempCompass marks a significant step towards better evaluating Video LLMs, the research acknowledges inherent limitations. The enduring effects of single-frame bias and language priors, despite efforts to mitigate them, and the challenges in fully automating evaluation, particularly for caption generation tasks, are notable concerns. Looking forward, addressing these limitations will be crucial in refining the benchmark.

Conclusion

TempCompass introduces a rigorous and nuanced framework for evaluating the temporal perception abilities of Video LLMs. Its innovative design and evaluation strategies not only advance the state-of-the-art in AI benchmarks but also highlight critical areas for future research in video understanding. As Video LLMs continue to evolve, benchmarks like TempCompass will play an indispensable role in guiding their development towards more sophisticated levels of temporal and video comprehension.

PDF Markdown

Tweets

https://twitter.com/gm8xx8/status/1764511432983335412