Validity of automatically constructed video benchmarks for video-essential properties
Determine whether evaluation datasets produced by automatic question-generation pipelines that rely on selected keyframes or on large language models using video transcripts genuinely evaluate video-essential properties, including temporal continuity, causal interaction, and multi-event narratives.
References
Consequently, dataset construction pipelines frequently rely on automatic strategies, such as generating questions from selected keyframes or using LLMs to produce questions based on video transcripts. In this process, it becomes unclear whether the resulting benchmarks truly evaluate video-essential properties that distinguish the modality from others, such as temporal continuity, causal interaction, and multi-event narratives.
— Video-Oasis: Rethinking Evaluation of Video Understanding
(2603.29616 - Lim et al., 31 Mar 2026) in Section 2.2, Challenges in Video Benchmark Construction