Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark (2506.04280v1)
Abstract: With enhanced capabilities and widespread applications, Multimodal LLMs (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.
- Ziming Cheng (6 papers)
- Binrui Xu (1 paper)
- Lisheng Gong (1 paper)
- Zuhe Song (1 paper)
- Tianshuo Zhou (9 papers)
- Shiqi Zhong (1 paper)
- Siyu Ren (24 papers)
- Mingxiang Chen (8 papers)
- Xiangchao Meng (3 papers)
- Yuxin Zhang (91 papers)
- Yanlin Li (7 papers)
- Lei Ren (36 papers)
- Wei Chen (1288 papers)
- Zhiyuan Huang (38 papers)
- Mingjie Zhan (23 papers)
- Xiaojie Wang (108 papers)
- Fangxiang Feng (15 papers)