Validity of automatically constructed video benchmarks for video-essential properties

Determine whether evaluation datasets produced by automatic question-generation pipelines that rely on selected keyframes or on large language models using video transcripts genuinely evaluate video-essential properties, including temporal continuity, causal interaction, and multi-event narratives.

Background

To reduce construction complexity, many video QA benchmarks are built using automatic strategies such as keyframe-based question generation or transcript-driven LLM generation. These shortcuts may inadvertently emphasize text-like reasoning and neglect core video-specific phenomena.

The paper raises concerns that such pipelines might not enforce temporal and causal dependencies intrinsic to video understanding, motivating a diagnostic re-examination via Video-Oasis.

References

Consequently, dataset construction pipelines frequently rely on automatic strategies, such as generating questions from selected keyframes or using LLMs to produce questions based on video transcripts. In this process, it becomes unclear whether the resulting benchmarks truly evaluate video-essential properties that distinguish the modality from others, such as temporal continuity, causal interaction, and multi-event narratives.

Video-Oasis: Rethinking Evaluation of Video Understanding  (2603.29616 - Lim et al., 31 Mar 2026) in Section 2.2, Challenges in Video Benchmark Construction