Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are We on the Right Way for Evaluating Large Vision-Language Models? (2403.20330v2)

Published 29 Mar 2024 in cs.CV
Are We on the Right Way for Evaluating Large Vision-Language Models?

Abstract: Large vision-LLMs (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

Evaluating Large Vision-LLMs: A New Benchmark and Metrics

Introduction to MMStar and New Metrics

Recent advancements in Large Vision-LLMs (LVLMs) have necessitated the development of accurate, reliable benchmarks that truly assess these models' multi-modal capabilities. Through an examination of current evaluation methodologies, two significant challenges were identified: the redundancy of visual content in many samples and unintentional data leakage during LLM and LVLM training. Addressing these issues, this paper introduces MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 rigorously selected human-reviewed samples. It's designed to evaluate the genuine multi-modal understanding of LVLMs across six core capabilities and eighteen detailed axes. Moreover, two new metrics, Multi-Modal Gain (MG) and Multi-Modal Leakage (ML), have been developed to measure the performance gains attributable to multi-modal training and the degree of data leakage, respectively.

Methodology of MMStar Benchmark Creation

The MMStar benchmark starts from a comprehensive dataset collection, primarily focusing on areas where existing benchmarks exhibit shortcomings. The curation process involved two distinct stages,

  • Automated Filtering: An initial coarse filtering using a set of criteria to ensure visual dependency and minimize data leakage. Eight powerful LLMs were utilized for preliminary sample selection, aiming to eliminate any that could potentially lack visual dependency or demonstrate evidence of LLM training data leakage.
  • Human Review: Subsequently, a stringent human review process ensured that selected samples necessitate visual understanding, cover a wide array of multi-modal capabilities, and present various difficulty levels. This phase solidified MMStar's goal to offer a benchmark that not only challenges LVLMs across multiple dimensions but does so with high-quality, meticulously vetted samples.

Core Capabilities and Dimensions

MMStar benchmarks LVLMs across six core capabilities: Coarse Perception (CP), Fine-grained Perception (FP), Instance Reasoning (IR), Logical Reasoning (LR), Science & Technology (ST), and Mathematics (MA), each split into three detailed axes. This comprehensive structure ensures a holistic evaluation of LVLMs' abilities to process and understand visual and textual content in tandem.

Introducing MG and ML Metrics

The paper proposes two new metrics to overcome current evaluation pitfalls:

  • Multi-modal Gain (MG): This metric quantifies the actual performance improvement attributable to multi-modal training, enhancing the understanding of how effectively an LVLM leverages visual information besides text.
  • Multi-modal Leakage (ML): This metric assesses the degree to which data leakage—unintended inclusion of evaluation samples in the training data—might influence the evaluation, ensuring fairer comparisons among models.

Evaluation and Findings

Upon evaluating 16 state-of-the-art LVLMs using MMStar and the proposed MG/ML metrics across seven popular benchmarks, it was observed that even the top-performing models underperform in certain core capabilities, emphasizing the challenging nature of MMStar. The employment of MG and ML metrics revealed insightful distinctions among LVLMs, illustrating varied degrees of multi-modal learning effectiveness and data leakage control.

Implications and Direction for Future Research

The introduction of the MMStar benchmark and novel MG and ML metrics mark significant strides towards more accurately evaluating and understanding LVLMs. This paper's findings underscore the importance of deliberate, careful construction of evaluation benchmarks and metrics to truly advance our comprehension of multi-modal AI capabilities. Looking ahead, the continued expansion of MMStar and dynamic evaluation methodologies promise to push the boundaries of what we expect from and how we assess LVLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  6. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  7. Can vision-language models think from a first-person perspective? arXiv preprint arXiv:2311.15596, 2023.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. O. Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  12. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  13. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  14. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  15. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  17. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  19. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  20. A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016.
  21. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  23. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023.
  24. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  25. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  28. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  29. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  30. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  31. Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023, 2023.
  32. Microsoft. Phi2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023.
  33. NousResearch. Nous-hermes-2-yi-34b. https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B, 2023.
  34. OpenAI. Chatgpt. https://chat.openai.com/, 2023.
  35. OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
  36. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  39. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  40. H. Taud and J.-F. Mas. Multilayer perceptron (mlp). Geomatic approaches for modeling land change scenarios, pages 451–455, 2018.
  41. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  42. I. Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  44. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  45. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  46. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023.
  47. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  48. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  49. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  50. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  51. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  52. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
  53. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024.
  54. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Lin Chen (384 papers)
  2. Jinsong Li (12 papers)
  3. Xiaoyi Dong (73 papers)
  4. Pan Zhang (153 papers)
  5. Yuhang Zang (54 papers)
  6. Zehui Chen (41 papers)
  7. Haodong Duan (55 papers)
  8. Jiaqi Wang (218 papers)
  9. Yu Qiao (563 papers)
  10. Dahua Lin (336 papers)
  11. Feng Zhao (110 papers)
Citations (118)