Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models (2306.13394v4)

Published 23 Jun 2023 in cs.CV
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Abstract: Multimodal LLM (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data application manner and online leaderboards are released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

A Comprehensive Evaluation Benchmark for Multimodal LLMs

The paper "MME: A Comprehensive Evaluation Benchmark for Multimodal LLMs" presents MME, a benchmark designed to evaluate the capabilities of Multimodal LLMs (MLLMs). The authors identify a significant gap in the evaluation of MLLMs—existing methods do not comprehensively assess their diverse abilities. By addressing both perception and cognition, MME provides a robust benchmark encompassing 14 subtasks.

Key Contributions

The authors make several distinct contributions:

  1. Comprehensive Benchmark: MME evaluates MLLMs on both perception and cognition tasks, providing a nuanced measure of their abilities across a range of 14 subtasks, including both coarse-grained and fine-grained recognition.
  2. Manual Annotation: To avoid data leakage, all instruction-answer pairs are manually constructed. This ensures that the models are evaluated on genuine understanding rather than memorization.
  3. Concise Instructions: The benchmark emphasizes concise instruction design, mitigating the impact of prompt engineering and focusing on model capability.
  4. Evaluation Metrics: The metrics used include accuracy and a stricter measure, accuracy+, which reflects comprehensive understanding.

Detailed Evaluation

The benchmark provides a detailed evaluation of 30 advanced MLLMs, including well-known models such as GPT-4V, BLIP-2, and others. The results indicate significant variability in model capabilities across different tasks.

  • Perception Tasks: These involve the recognition and understanding of visual elements such as existence, count, position, and color of objects, as well as fine-grained tasks such as identifying specific scenes or artworks. The paper reveals that models display notable differences, with some excelling in specific areas like object existence while facing challenges in object position perception.
  • Cognition Tasks: These tasks require reasoning abilities that combine visual perception with knowledge from LLMs. The results suggest existing MLLMs need further development to excel consistently across tasks like commonsense reasoning and code understanding.

Implicated Challenges and Future Directions

The paper highlights several challenges identified through the evaluation:

  • Instruction Following: Some MLLMs fail to adhere to clear and concise instructions, indicating a gap in effective instruction following.
  • Basic Perception and Reasoning: Many models struggle with basic perception tasks, particularly in more nuanced scenarios like counting and spatial recognition, as well as in logical reasoning required for tasks such as arithmetic calculations.
  • Object Hallucination: A prominent issue is that models sometimes produce outputs based on non-existent objects, highlighting the need for improved grounding mechanisms.

Implications and Future Work

The introduction of MME provides crucial insights into the current capabilities and limitations of MLLMs. The paper suggests that while these models exhibit impressive emergent abilities, there is a substantial room for improvement, particularly in reducing hallucinations and following instructions more reliably. Future research can leverage MME to benchmark advances in MLLM architecture and training methods that aim to address these identified limitations.

This benchmark represents a foundational step in the evaluation of multimodal AI systems, contributing valuable data and insights that can drive the next wave of innovations in AI research and development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Infmllm. https://github.com/mightyzau/InfMLLM, 2023a.
  2. Lion. https://github.com/mynameischaos/Lion, 2023b.
  3. Octopus. https://github.com/gray311/UnifiedMultimodalInstructionTuning, 2023c.
  4. Skywork-mm. https://github.com/will-singularity/Skywork-MM, 2023d.
  5. Visualglm-6b. https://github.com/THUDM/VisualGLM-6B, 2023e.
  6. Wemm. https://github.com/scenarios/WeMM, 2023f.
  7. Xcomposer-vl. https://github.com/InternLM/InternLM-XComposer, 2023g.
  8. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  9. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint:2308.12966, 2023.
  10. Language models are few-shot learners. NeurIPS, 2020.
  11. Microsoft coco captions: Data collection and evaluation server. arXiv preprint:1504.00325, 2015.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint:2305.06500, 2023.
  13. Palm-e: An embodied multimodal language model. arXiv preprint:2303.03378, 2023.
  14. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint:2304.15010, 2023.
  15. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint:2305.04790, 2023.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  17. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint:2309.03905, 2023.
  18. Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint:2308.09936, 2023.
  19. Movienet: A holistic dataset for movie understanding. In ECCV, 2020.
  20. Language is not all you need: Aligning perception with language models. arXiv preprint:2302.14045, 2023.
  21. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint:2306.05425, 2023a.
  22. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint:2305.03726, 2023b.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint:2301.12597, 2023c.
  24. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. arXiv preprint:2308.04152, 2023d.
  25. Evaluating object hallucination in large vision-language models. arXiv preprint:2305.10355, 2023e.
  26. Microsoft coco: Common objects in context. In ECCV, 2014.
  27. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint:2311.07575, 2023.
  28. Aligning large multi-modal model with robust instruction tuning. arXiv preprint:2306.14565, 2023a.
  29. Visual instruction tuning. arXiv preprint:2304.08485, 2023b.
  30. Curved scene text detection via transverse and longitudinal sequence connection. PR, 2019.
  31. Mmbench: Is your multi-modal model an all-around player? arXiv preprint:2307.06281, 2023c.
  32. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
  33. Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint:2305.15023, 2023.
  34. Deepart: Learning joint representations of visual arts. In ICM, 2017.
  35. Visual arts search on mobile devices. TOMM, 2019.
  36. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  37. OpenAI. Gpt-4 technical report. arXiv preprint:2303.08774, 2023.
  38. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint:2303.17580, 2023.
  39. Pandagpt: One model to instruction-follow them all. arXiv preprint:2305.16355, 2023.
  40. Llama: Open and efficient foundation language models. arXiv preprint:2302.13971, 2023.
  41. Git: A generative image-to-text transformer for vision and language. arXiv preprint:2205.14100, 2022.
  42. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint:2305.11175, 2023.
  43. Chain of thought prompting elicits reasoning in large language models. arXiv preprint:2201.11903, 2022.
  44. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In CVPR, 2020.
  45. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint:2303.04671, 2023a.
  46. An early evaluation of gpt-4v (ision). arXiv preprint:2310.16534, 2023b.
  47. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint:2212.10773, 2022.
  48. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint:2311.04257, 2023.
  49. A survey on multimodal large language models. arXiv preprint:2306.13549, 2023a.
  50. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint:2310.16045, 2023b.
  51. Reformulating vision-language foundation models and datasets towards universal multimodal assistants. arXiv preprint:2310.00653, 2023.
  52. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint:2307.02469, 2023.
  53. Transfer visual prompt generator across llms. arXiv preprint:2305.01278, 2023.
  54. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint:2309.07915, 2023a.
  55. A survey of large language models. arXiv preprint:2303.18223, 2023b.
  56. On evaluating adversarial robustness of large vision-language models. arXiv preprint:2305.16934, 2023c.
  57. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint:2305.16103, 2023d.
  58. Learning deep features for scene recognition using places database. NeurIPS, 2014.
  59. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Chaoyou Fu (46 papers)
  2. Peixian Chen (21 papers)
  3. Yunhang Shen (54 papers)
  4. Yulei Qin (17 papers)
  5. Mengdan Zhang (18 papers)
  6. Xu Lin (5 papers)
  7. Jinrui Yang (8 papers)
  8. Xiawu Zheng (63 papers)
  9. Ke Li (722 papers)
  10. Xing Sun (93 papers)
  11. Yunsheng Wu (25 papers)
  12. Rongrong Ji (315 papers)
Citations (544)
Youtube Logo Streamline Icon: https://streamlinehq.com