Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models (2403.01777v2)

Published 4 Mar 2024 in cs.CL and cs.CV

Abstract: Understanding the reasoning capabilities of Multimodal LLMs (MLLMs) is an important area of research. In this study, we introduce a dynamic benchmark, NPHardEval4V, aimed at addressing the existing gaps in evaluating the pure reasoning abilities of MLLMs. Our benchmark aims to provide a venue to disentangle the effect of various factors such as image recognition and instruction following, from the overall performance of the models, allowing us to focus solely on evaluating their reasoning abilities. It is built by converting textual description of questions from NPHardEval to image representations. Our findings reveal significant discrepancies in reasoning abilities across different models and highlight the relatively weak performance of MLLMs compared to LLMs in terms of reasoning. We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs, demonstrating the different impacts of multimodal inputs in model performance. Unlike traditional benchmarks, which focus primarily on static evaluations, our benchmark will be updated monthly to prevent overfitting and ensure a more authentic and fine-grained evaluation of the models. We believe that this benchmark can aid in understanding and guide the further development of reasoning abilities in MLLMs. The benchmark dataset and code are available at https://github.com/lizhouf/NPHardEval4V

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A bibliometric review of large language models research from 2017 to 2023. arXiv preprint arXiv:2304.02020, 2023.
  2. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  3. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  4. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024.
  5. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  6. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  7. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  8. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  9. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
  10. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems, 36, 2024.
  11. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023.
  12. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  13. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  14. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  15. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  16. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
  17. OpenAI. Gpt-4v(ision) system card, 2023.
  18. Google. Meet the first version of gemini— our most capable ai model., 2023.
  19. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  20. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  22. Introducing our multimodal models, 2023.
  23. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  24. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  25. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  26. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023.
  27. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 958–979, 2024.
  28. Gemini goes to med school: Exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. arXiv preprint arXiv:2402.07023, 2024.
  29. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725, 2023.
  30. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  31. Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv preprint arXiv:2312.14890, 2023.
  32. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571, 2023.
  33. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023.
  34. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023.
  35. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  36. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  37. Grounding language models to images for multimodal inputs and outputs. 2023.
  38. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877, 2023.
  39. Language models with image descriptors are strong few-shot video-language learners. Advances in Neural Information Processing Systems, 35:8483–8497, 2022.
  40. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  41. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
  42. Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead. arXiv preprint arXiv:2311.02782, 2023.
  43. Benchmarking large multimodal models against common corruptions. arXiv preprint arXiv:2401.11943, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Lizhou Fan (23 papers)
  2. Wenyue Hua (51 papers)
  3. Xiang Li (1002 papers)
  4. Kaijie Zhu (19 papers)
  5. Mingyu Jin (38 papers)
  6. Lingyao Li (38 papers)
  7. Haoyang Ling (2 papers)
  8. Jinkui Chi (1 paper)
  9. Jindong Wang (150 papers)
  10. Xin Ma (105 papers)
  11. Yongfeng Zhang (163 papers)
Citations (13)