MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2311.16502v4)

Published 27 Nov 2023 in cs.CL, cs.AI, and cs.CV

Abstract: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

PDF Abstract

The Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark represents a significant advancement in evaluating the current capabilities of AGI, specifically in multimodal understanding and reasoning. MMMU challenges models with a series of comprehensive and difficult tasks reminiscent of college-level examinations encompassing six broad disciplines and thirty college subjects, categorically spanning 183 subfields. These disciplines include Arts & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

What sets MMMU apart is its extensive library of 11.5K questions meticulously collected from various academic materials. The inclusion of 30 highly diverse types of images—ranging from engineering blueprints to histological slides—reflects the benchmark's emphasis on deep multimodal engagement. The images are interwoven with text, requiring models to analyze and reason through a combination of visual and textual cues bound by domain expertise.

MMMU's goal is to test models' ability to achieve expert-level perception and advanced reasoning skills. This is observed in its evaluation of state-of-the-art LLMs and Large Multimodal Models (LMMs), such as OpenFlamingo and GPT-4, among others. The results from MMMU reveal significant challenges posed to even the most advanced models, like GPT-4Vision, which only achieved a 56% accuracy score. However, a closer examination of 150 mispredicted cases by GPT-4Vision shows that 35% of errors can be attributed to perceptual issues, 29% to a lack of domain knowledge, and 26% to reasoning process flaws, signaling specific areas for further research and model enhancement.

Furthermore, the benchmark demonstrates different performance across various disciplines. For example, in fields like Art & Design and Humanities & Social Science, where visual complexity is relatively manageable, models show higher performance compared to more complex domains like Business, Science, Health & Medicine, and Tech & Engineering. Unsurprisingly, open-source models lag behind proprietary iterations like GPT-4Vision, highlighting a disparity in the capabilities of multimodal understanding.

MMMU is not a definitive test for Expert AGI, as it does not yet encompass the range of tasks an AGI should handle, and it does not directly measure performance against the 90th percentile of skilled adults. Nevertheless, it serves as a crucial component in evaluating a model’s proficiency in domain-specific knowledge and expert-level reasoning and understanding.

In conclusion, MMMU represents a robust and challenging multimodal benchmark that pushes the boundaries of multimodal foundation models. Its unique approach in integrating diverse image types with text rooted in domain-specific questions sets a new precedent for AGI evaluation and will undoubtedly propel the AI community towards more profound advancements.

PDF Markdown Bookmark Chat (Pro)

Authors (22)

Xiang Yue (72 papers)
Yuansheng Ni (14 papers)
Kai Zhang (542 papers)
Tianyu Zheng (28 papers)
Ruoqi Liu (10 papers)
Ge Zhang (170 papers)
Samuel Stevens (17 papers)
Dongfu Jiang (14 papers)
Weiming Ren (12 papers)
Yuxuan Sun (79 papers)
Cong Wei (16 papers)
Botao Yu (13 papers)
Ruibin Yuan (43 papers)
Renliang Sun (17 papers)
Ming Yin (70 papers)
Boyuan Zheng (27 papers)
Zhenzhu Yang (9 papers)
Yibo Liu (34 papers)
Wenhao Huang (98 papers)
Huan Sun (88 papers)

Citations (466)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

MMMU

Tweets

https://twitter.com/SamuelAlbanie/status/1754796581872631904

https://twitter.com/andi_marafioti/status/1810611808824066473

https://twitter.com/TheXeophon/status/1919347346191589672

https://twitter.com/01AI_Yi/status/1749733040040067369

https://twitter.com/amos_xwang/status/1804576352177094847

https://twitter.com/knishimae0531/status/1755015942025912322

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2311.16502v4)

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit