MMVU: Measuring Expert-Level Multi-Discipline Video Understanding (2501.12380v1)

Published 21 Jan 2025 in cs.CV, cs.AI, and cs.CL

Abstract: We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

PDF Abstract

An Expert Overview of MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

The development and evaluation of Multimodal Foundation Models (MFMs) have seen considerable advancement in recent years, yet the task of comprehensively assessing video-based expert-level reasoning remains largely underexplored. The paper "MMVU: Measuring Expert-Level Multi-Discipline Video Understanding" introduces a sophisticated benchmark—MMVU—that addresses this gap in evaluating MFMs across specialized domains. The authors propose a structured approach to create a comprehensive dataset aimed at testing MFMs' ability to understand and reason about videos in an expert-level, multi-disciplinary context.

MMVU sets itself apart with three primary advancements compared to existing benchmarks. First, it demands the application of domain-specific knowledge and expert-level reasoning beyond basic visual perception. This places higher cognitive demands on the models, requiring them to interpret and understand specialized, knowledge-intensive content. Second, each dataset example is thoroughly annotated by human experts from scratch, ensuring high quality and reliability through rigorous data control practices. Finally, each example is also accompanied by expert-annotated detailed reasoning rationales and relevant domain knowledge, which facilitates in-depth performance analysis and insightful error analysis.

The MMVU benchmark includes 3,000 expert-annotated questions across 27 subjects within four core disciplines: Science, Healthcare, Humanities and Social Sciences, and Engineering. These questions are sourced from 1,529 videos covering a diverse range of topics. Through a textbook-guided protocol, expert annotators identify key concepts needing dynamic visual representation, curate relevant video clips from Creative Commons licensed content, and formulate questions demanding integration of domain knowledge. The MMVU dataset is distinctive for requiring models to engage in both temporal dynamics and complex procedural knowledge inherent in video content.

The paper presents a detailed evaluation of 32 leading-edge multimodal foundation models. Among these, the System-2-capable models, such as o1 and Gemini 2.0 Flash Thinking, demonstrated relatively superior performance. However, even these advanced systems continue to fall short of human expert-level capabilities, highlighting the persistent challenges MFMs face in achieving comprehensive video understanding and reasoning. For instance, GPT-4o scored 66.7% in the open-book setting, significantly below the human benchmark of 86.8%.

The paper also investigates the implications of employing Chain-of-Thought (CoT) reasoning, showing that it generally enhances model performance by encouraging step-by-step reasoning processes before arriving at a conclusion. This suggests future research should focus on refining reasoning strategies to leverage multimodal inputs effectively.

The potential applications of this research are extensive, with implications for the progression of AI toward achieving human-equivalent reasoning in specialized fields. The insights gained from evaluating MFMs on the MMVU could support the development of AI systems capable of more profound and comprehensive understanding of complex, dynamic environments. Future directions in AI research, as indicated by MMVU's findings, may focus on leveraging CoT strategies and improving the integration of diverse data modalities for more accurate, context-aware reasoning capabilities.

Through detailed error analyses and case studies, the authors provide actionable insights that highlight persistent challenges in video-based expert reasoning. By offering a granular evaluation framework, MMVU sets a new bar for evaluating multimodal models and facilitates targeted advancements. This paper serves as both a benchmark for current capabilities and a catalyst for ongoing research in knowledge-intensive video understanding within specialized domain contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Yilun Zhao (59 papers)
Lujing Xie (2 papers)
Haowei Zhang (17 papers)
Guo Gan (4 papers)
Yitao Long (5 papers)
Zhiyuan Hu (30 papers)
Tongyan Hu (2 papers)
Weiyuan Chen (3 papers)
Chuhan Li (4 papers)
Junyang Song (1 paper)
Zhijian Xu (15 papers)
Chengye Wang (4 papers)
Weifeng Pan (1 paper)
Ziyao Shangguan (3 papers)
Xiangru Tang (62 papers)
Zhenwen Liang (22 papers)
Yixin Liu (108 papers)
Chen Zhao (249 papers)
Arman Cohan (121 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1881926218423091537

https://twitter.com/javaeeeee1/status/1882020549662236873

https://twitter.com/susumuota/status/1882580773259079995

https://twitter.com/jbohnslav/status/1882519085176766669