MMBench: Is Your Multi-modal Model an All-around Player? (2307.06281v5)

Published 12 Jul 2023 in cs.CV and cs.CL

Abstract: Large vision-LLMs (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates LLMs to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-LLMs. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

PDF Abstract

MMBench: Evaluating Multi-modal Models

The paper "MMBench: Is Your Multi-modal Model an All-around Player?" presents a novel approach to benchmarking large vision-LLMs (LVLMs). Given recent advances in LVLMs, which exhibit significant capabilities in perception and reasoning, assessing these models comprehensively has become challenging. Traditional benchmarks like VQAv2 and COCO Caption offer quantitative metrics but lack detailed ability delineation. Meanwhile, subjective benchmarks such as OwlEval, relying on human evaluation, face scalability and bias issues. MMBench addresses these limitations with a more systematic and objective evaluation methodology.

Core Contributions

MMBench is primarily composed of two innovative components: a curated dataset and a new evaluation strategy, namely CircularEval. The dataset surpasses existing benchmarks in the variety and number of evaluation questions across 20 ability dimensions. CircularEval is designed to yield more robust predictions by converting free-form outputs to pre-defined choices using ChatGPT. This approach ensures that models are evaluated on their true ability to provide coherent and contextually relevant predictions.

Numerical Results and Claims

The empirical evaluation illustrated with MMBench is extensive, covering 14 well-known vision-LLMs. Notably, the integration of object localization data in the training set significantly enhances model performance, particularly for Kosmos-2 and Shikra. These models demonstrate superior performance across numerous L-2 abilities. In contrast, models like OpenFlamingo and MMGPT exhibit lower performance levels, underscoring the diverse strengths and weaknesses of current LVLMs.

Implications and Future Directions

This research has practical implications in the field of multi-modal AI, providing a robust tool for the comprehensive assessment of model capabilities. The inclusion of abilities like fine-grained perception, logical, and social reasoning reflects the need for nuanced evaluation criteria that can better guide future model developments. Moreover, the methods introduced with MMBench, especially the use of CircularEval and leveraging LLMs like ChatGPT for choice extraction, highlight potential avenues for enhancing evaluation protocols.

Future research can expand upon this benchmark by incorporating additional ability dimensions, adapting the evaluation method for few-shot learning, and exploring its application across other AI systems. The novel integration of robust evaluation strategies could also influence the development of more sophisticated training paradigms for LVLMs.

Conclusion

MMBench represents a significant stride toward more detailed and reliable evaluation of multi-modal models. By addressing the limitations of previous benchmarks and proposing a novel evaluation framework, it sets the stage for more rigorous and comprehensive assessment of complex AI systems in the domain of vision-language interaction.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Yuan Liu (342 papers)
Haodong Duan (55 papers)
Yuanhan Zhang (29 papers)
Bo Li (1107 papers)
Songyang Zhang (116 papers)
Wangbo Zhao (25 papers)
Yike Yuan (5 papers)
Jiaqi Wang (218 papers)
Conghui He (114 papers)
Ziwei Liu (368 papers)
Kai Chen (512 papers)
Dahua Lin (336 papers)

Citations (596)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/chriswolfvision/status/1860735283408327026

https://twitter.com/CSVisionPapers/status/1826733256387055883