VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation (2505.14640v1)

Published 20 May 2025 in cs.CV

Abstract: Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

Summary

The paper introduces VideoEval-Pro, a novel benchmark with open-ended questions designed to provide a more robust and realistic evaluation of long video understanding models than traditional multiple-choice methods.
Experimental results on VideoEval-Pro show a significant performance drop (>25%) for LMMs compared to MCQ benchmarks and reveal that higher MCQ scores do not correlate with better open-ended performance.
VideoEval-Pro highlights the current limitations of LMMs in long video understanding, suggesting implications for real-world applications like surveillance and autonomous driving while guiding future model development.

Critical Evaluation of Long Video Understanding Benchmarks: An Analysis of VideoEval-Pro

In the context of ongoing advancements in large multimodal models (LMMs) designed for long video understanding (LVU), the paper titled "VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation" presents a compelling analysis and potential solution to the challenges facing existing benchmarks. The authors assert the limitations inherent in current LVU benchmarking methods that predominantly rely on multiple-choice questions (MCQs), resulting in inflated evaluative outcomes that do not accurately reflect an LMM’s ability to understand video content in its entirety. This paper is particularly relevant to researchers working on video understanding technologies, as it proposes a novel benchmark—VideoEval-Pro—which seeks to address these shortcomings by implementing open-ended short-answer questions.

Key Findings and Contributions

The paper's pivotal claim is the inflated nature of existing LVU benchmarks, largely due to the tendency of MCQs to simplify the problem space, thereby enabling models to achieve higher accuracy through random guessing or leveraging question priors without genuine understanding. For instance, the Gemini-1.5-Pro model reportedly achieved over 50% accuracy by merely assessing a single random frame on the Video-MME benchmark. Additionally, contrary to intuitive understanding, the paper notes that an increase in the number of frames does not correlate with improved model performance on traditional benchmarks, undermining the robustness and validity of these evaluative standards.

To address these limitations, VideoEval-Pro proposes a shift from MCQs to open-ended short-answer formats that demand comprehensive video understanding, spanning both segment-level and full-video comprehension through tasks focused on perception and reasoning. Experimental evaluations conducted using VideoEval-Pro revealed key insights:

There exists a stark performance drop (>25%) for LMMs when transitioning from MCQs to open-ended questions.
Higher scores on MCQs do not translate to better performance on open-ended questions, indicating a dissonance in evaluative coherence.
VideoEval-Pro demonstrates improved accuracy as more frames are included, highlighting the need for enriched temporal information.

Benchmark Design and Evaluation

The construction of VideoEval-Pro involves rigorous data filtering and methodological adaptations. The authors implement multiple stages of filtering, from duration assessment to answerability checks, ensuring the relevance and challenge of the questions posed. This systematic approach results in a dataset comprising 1,289 high-quality question-answer pairs from varied video sources, with an average video duration of 38 minutes. The evaluation results provide a nuanced understanding of performance across various task categories—local perception, local reasoning, holistic perception, and holistic reasoning.

Significantly, the paper highlights the disparity in performance between proprietary and open-source models on VideoEval-Pro, suggesting potential brittleness in open-source models when faced with complex LVU tasks, despite their leading performances on existing benchmarks. This distinction underscores the necessity for robust evaluative frameworks that accurately map to real-world comprehension capabilities.

Implications and Future Research Directions

The implications of this research are profound in the domains of video surveillance, autonomous driving, and instructional video summarization—all areas where reliable long video understanding is crucial. By presenting a more realistic and challenging benchmark, VideoEval-Pro sets the stage for future developments in LVU technologies and stimulates discussions on improving model robustness and fidelity.

Moving forward, the research may pivot towards fine-tuning LMMs to enhance processing capabilities for extended temporal sequences, and exploring reinforcement learning techniques tailored for LVU. Additionally, there remains potential for expanding the benchmark to encompass broader video genres and applications, thereby securing its relevance across an increasingly diverse array of use cases.

In conclusion, the VideoEval-Pro benchmark represents a significant step forward in LVU evaluation, providing a rigorous, realistic measure that seeks to faithfully evaluate the comprehension capabilities of current and future video LMMs. The paper’s methodological rigor and critical insights present substantial groundwork for subsequent research and technological evolution within the LVU space.

Related Papers

Tweets

https://twitter.com/SergioPaniego/status/1926956575555543207

https://twitter.com/GptMaestro/status/1937428071516373263

https://twitter.com/wmren993/status/1925241541171327459