MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (2311.17005v4)

Published 28 Nov 2023 in cs.CV

Abstract: With the rapid development of Multi-modal LLMs (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

PDF HTML Abstract

Insights into MVBench: A Benchmark for Multi-Modal Video Understanding

The paper entitled "MVBench: A Comprehensive Multi-modal Video Understanding Benchmark" presents an in-depth exploration of the limitations of current multi-modal LLMs (MLLMs) and proposes an innovative benchmark designed to address existing deficiencies in video understanding tasks.

Overview of MVBench

The development of MVBench is driven by the realization that current diagnostic benchmarks fall short in evaluating the temporal comprehension capabilities of MLLMs. Where traditional benchmarks concentrate predominantly on static image-based tasks, MVBench innovatively transitions these tasks to a dynamic video context. This shift introduces 20 video tasks encompassing a wide array of temporal skills, encompassing both perception and cognition. A distinctive static-to-dynamic method enables precise transformation of image tasks into video tasks, providing a more comprehensive challenge that extends beyond mere frame-based analysis.

Automatic QA Conversion and Evaluation Paradigm

An important aspect of the MVBench methodology is the automated conversion of existing public video annotations into a multiple-choice question-answering format. This automation minimizes manual intervention and ensures ambiguity-free evaluation leveraging ground truth video annotations. The use of a robust system prompt alongside a simplified answer prompt ensures precision in responses and maximizes the evaluation's robustness and fairness.

Video MLLM Baseline: VideoChat2

Amidst the observed inadequacies in existing MLLMs, especially regarding temporal understanding, the paper introduces VideoChat2. This baseline employs progressive multi-modal training strategies to improve temporal performance significantly. VideoChat2 aligns video attributes with LLMs through a novel architecture that integrates vision encoders and LLMs with a streamlined QFormer. This model boasts a considerable improvement, surpassing leading models by over 15% on MVBench.

Results and Implications

The evaluations conducted on MVBench reveal crucial insights into present video MLLM capabilities. Surprisingly, many top-performing models lag in tasks necessitating temporal reasoning. VideoChat2 addresses these deficiencies effectively by providing a significant leap in performance, particularly in action, object, scene, pose, and attribute tasks. However, there remain challenges, especially in tasks related to position, count, and character recognition.

The findings suggest substantial implications for future MLLM development. Attention should be directed towards enhancing grounding and reasoning capabilities and exploring deeper integration of multi-modal data. The presented results posit that future advancements could focus on additional modalities such as depth and audio, further augmenting video comprehension.

Future Directions

The framework provided by MVBench paves the way for comprehensive evaluation and development of MLLMs capable of nuanced temporal understanding. There remains a vast potential for innovation in video-based AI models, including refining data annotations and extending the range of evaluation strategies. As research continues, MVBench will likely play an essential role in the progression toward more sophisticated, generalized video understanding models.

Overall, the paper provides a foundational contribution to the field of multi-modal AI by realigning evaluation benchmarks with the dynamic realities of video content. As AI continues to evolve, benchmarks like MVBench will be critical in guiding the design and training of next-generation video understanding models.

PDF Markdown Bookmark Chat (Pro)

References (97)

Authors (12)

Yali Wang (78 papers)
Yinan He (34 papers)
Yizhuo Li (21 papers)
Yi Wang (1038 papers)
Yi Liu (543 papers)
Zun Wang (42 papers)
Jilan Xu (32 papers)
Guo Chen (107 papers)
Ping Luo (340 papers)
Limin Wang (221 papers)
Yu Qiao (563 papers)
KunChang Li (43 papers)

Citations (198)

View on Semantic Scholar

GitHub

GitHub - OpenGVLab/Ask-Anything: [CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS. (2,774 stars)