MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria (2311.13951v3)
Abstract: Multimodal LLMs (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria. See online leaderboard at \url{https://mLLM-bench.LLMzoo.com}.
- Falcon-40B: an open large language model with state-of-the-art performance. (2023).
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV]
- Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023).
- Introducing our Multimodal Models. https://www.adept.ai/blog/fuyu-8b
- Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595 (2023).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478 [cs.CV]
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023).
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023).
- Making LLaMA SEE and Draw with SEED Tokenizer. arXiv:2310.01218 [cs.CV]
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
- David R Krathwohl. 2002. A revision of Bloom’s taxonomy: An overview. Theory into practice 41, 4 (2002), 212–218.
- OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents. arXiv:2306.16527 [cs.IR]
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv:2307.16125 [cs.CL]
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023).
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023).
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566 (2023).
- Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 [cs.CV]
- Visual Instruction Tuning.
- MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281 (2023).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv:2306.14824 [cs.CL]
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023).
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 (2022).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning. arXiv:2311.07574 [cs.CV]
- Large Language Models are not Fair Evaluators. arXiv:2305.17926 [cs.CL]
- CogVLM: Visual Expert for Pretrained Language Models. arXiv:2311.03079 [cs.CV]
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023).
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv:2311.04257 [cs.CL]
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023).
- GPT-4V (ision) as a Generalist Evaluator for Vision-Language Tasks. arXiv preprint arXiv:2311.01361 (2023).
- A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL]
- Wentao Ge (2 papers)
- Shunian Chen (15 papers)
- Junying Chen (26 papers)
- Zhihong Chen (63 papers)
- Shuo Yan (13 papers)
- Chenghao Zhu (9 papers)
- Ziyue Lin (6 papers)
- Wenya Xie (8 papers)
- Xidong Wang (30 papers)
- Anningzhe Gao (22 papers)
- Jianquan Li (18 papers)
- Xiang Wan (93 papers)
- Benyou Wang (109 papers)
- Guiming Hardy Chen (8 papers)
- Nuo Chen (100 papers)
- Song Dingjie (1 paper)
- Zhang Zhiyi (2 papers)