MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (2404.16006v1)
Abstract: Large Vision-LLMs (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
- Kaining Ying (5 papers)
- Fanqing Meng (14 papers)
- Jin Wang (356 papers)
- Zhiqian Li (2 papers)
- Han Lin (53 papers)
- Yue Yang (146 papers)
- Hao Zhang (947 papers)
- Wenbo Zhang (49 papers)
- Yuqi Lin (10 papers)
- Shuo Liu (123 papers)
- Jiayi Lei (7 papers)
- Quanfeng Lu (10 papers)
- Runjian Chen (20 papers)
- Peng Xu (357 papers)
- Renrui Zhang (100 papers)
- Haozhe Zhang (17 papers)
- Peng Gao (401 papers)
- Yali Wang (78 papers)
- Yu Qiao (563 papers)
- Ping Luo (340 papers)