Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lost in Time: A New Temporal Benchmark for VideoLLMs

Published 10 Oct 2024 in cs.CV | (2410.07752v3)

Abstract: LLMs have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than video reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-LLMs perform similarly to random performance on TVBench, with only a few models such as Qwen2-VL, and Tarsier clearly surpassing this baseline.

Summary

  • The paper introduces TVBench, a benchmark emphasizing hard temporal constraints to rigorously test video-language models’ temporal reasoning.
  • It employs a template-based question design and balanced candidate answers to eliminate static and textual biases.
  • Experimental results reveal that state-of-the-art models perform near random on TVBench, highlighting the critical need for genuine temporal understanding.

A New Temporal Benchmark for VideoLLMs: TVBench

Abstract

The paper "Lost in Time: A New Temporal Benchmark for VideoLLMs" introduces TVBench, a novel open-source benchmark designed to evaluate the temporal understanding capabilities of video-LLMs. This benchmark addresses significant shortcomings in existing video-language benchmarks, particularly MVBench, by emphasizing the necessity of temporal reasoning to solve video-related tasks. Through extensive evaluations, the authors highlight that current state-of-the-art models perform close to random on TVBench, underscoring the benchmark's ability to differentiate between models with genuine temporal reasoning capability.

Introduction

Video-LLMs have gained prominence by effectively leveraging advancements in both NLP and vision models to understand video content. However, evaluating these models is challenging, as many existing benchmarks, like MVBench, fail to adequately test temporal reasoning. MVBench, a widely-used benchmark, has been identified to have significant biases, allowing tasks to be solved with static information or textual cues without temporal understanding. These issues compromise its reliability in measuring genuine video comprehension and temporal reasoning.

Limitations of Existing Benchmarks

Existing video-language benchmarks suffer from several problems:

  1. Static Information Sufficiency: Tasks can often be solved using information from a single frame, rather than requiring analysis of the entire video sequence.
  2. Textual and World Knowledge Bias: Overly informative text allows models to answer questions correctly without relying on visual content. Prior world knowledge often compensates for the lack of video analysis.
  3. Unreliability in Open-Ended QA: Automatic evaluation using LLMs, like GPT-3.5, is prone to inconsistencies and hallucinations, leading to unreliable assessment of open-ended video QA tasks.

TVBench Design Principles

TVBench was developed to address these issues:

  • Hard Temporal Constraints: The benchmark includes tasks with temporally challenging aspects, ensuring that answers cannot be deduced without analyzing the temporal sequence (Figure 1). Figure 1

    Figure 1: TVBench a temporal video-language benchmark. In TVBench, state-of-the-art text-only, image-based, and most video-LLMs perform close to random chance, with only the latest strong temporal models, such as Tarsier, outperforming the random baseline. In contrast to MVBench, the performance of these temporal models significantly drops when videos are reversed.

  • Question and Candidate Design: Questions are generated using templates to eliminate textual bias, and candidate answers are balanced to avoid biases towards certain assumptions.
  • Minimal World Knowledge Reliance: Tasks are designed so that answers rely solely on video content, not on external knowledge.

Evaluation and Results

The evaluation on TVBench reveals that many current video-LLMs, despite their state-of-the-art status in other benchmarks, perform at random chance level. Notably, the temporal models Tarsier and Gemini 1.5 Pro demonstrated superior performance, clearly surpassing random baselines due to their temporal reasoning capabilities.

  • Text-Only and Image-Model Performance: These models perform at random chance level on TVBench, indicating a reliance on temporal understanding rather than static information or textual cues.
  • Impact of Video Shuffling and Reversing: Unlike their performance on MVBench, models experience a significant drop when videos are shuffled or reversed on TVBench, emphasizing the benchmark's effectiveness in temporal evaluation (Figures 2 and 3). Figure 2

    Figure 2: Spatial bias of MVBench video-language benchmark. We show different tasks of the MVBench benchmark and observe that the question can be answered without requiring any temporal understanding.

    Figure 3

    Figure 3: Textual bias of MVBench video-language benchmark. We show different tasks of MVBench and find that questions can be answered without taking the visual part into account.

Discussion

TVBench effectively highlights the limitations of current models and benchmarks in evaluating the temporal aspect of video understanding. The stark performance drop on TVBench compared to MVBench underscores the latter's inadequate temporal challenges. TVBench serves as a robust tool for future advancements in temporal video-LLM evaluation.

Conclusion

TVBench addresses critical shortcomings in existing benchmarks by focusing on temporal reasoning, providing a necessary tool for progressing video-LLM assessment. As video understanding models advance, TVBench can guide researchers in developing innovations that genuinely understand and reason through temporal video sequences, enhancing the field's evaluation standards. Figure 4

Figure 4: Unreliability of open-ended video-language benchmarks. GPT 3.5 is commonly used as an evaluator of open-ended responses, here we use Llama 3 in a text-only setting to generate answers. GPT gives confusing accuracies and scores. Smiley emoji shows truthful or unreliable evaluation from GPT 3.5.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 75 likes about this paper.