MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria (2311.13951v3)

Published 23 Nov 2023 in cs.CL

Abstract: Multimodal LLMs (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria. See online leaderboard at \url{https://mLLM-bench.LLMzoo.com}.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (17)

Wentao Ge (2 papers)
Shunian Chen (15 papers)
Junying Chen (26 papers)
Zhihong Chen (63 papers)
Shuo Yan (13 papers)
Chenghao Zhu (9 papers)
Ziyue Lin (6 papers)
Wenya Xie (8 papers)
Xidong Wang (30 papers)
Anningzhe Gao (22 papers)
Jianquan Li (18 papers)
Xiang Wan (93 papers)
Benyou Wang (109 papers)
Guiming Hardy Chen (8 papers)
Nuo Chen (100 papers)
Song Dingjie (1 paper)
Zhang Zhiyi (2 papers)

Citations (3)

View on Semantic Scholar

GitHub

GitHub - FreedomIntelligence/MLLM-Bench: Evaluating Multi-modal LLMs using GPT-4V (70 stars)

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria (2311.13951v3)

Related Papers

GitHub