MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence (2505.23764v1)

Published 29 May 2025 in cs.CV and cs.CL

Abstract: Spatial intelligence is essential for multimodal LLMs (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

Summary

The paper introduces MMSI-Bench, a novel VQA benchmark with 1,000 sophisticated multi-image spatial reasoning questions designed to evaluate multimodal large language models.
Evaluation of 34 MLLMs on MMSI-Bench reveals a significant performance gap compared to humans, with the best models achieving only 30-40% accuracy versus 97% human accuracy.
Error analysis highlights key failure modes in MLLMs, such as grounding and spatial logic errors, providing valuable insights for future model development to enhance spatial intelligence.

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

The paper, "MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence," introduces a novel and comprehensive benchmark designed to evaluate multi-image spatial reasoning capabilities within multimodal LLMs (MLLMs). The core contribution of this work is MMSI-Bench, a VQA benchmark curated to rigorously test multi-image spatial intelligence. Unlike existing benchmarks primarily focused on single-image relations, MMSI-Bench challenges models with a set of questions that necessitate reasoning across multiple images, thereby simulating more realistic and complex real-world scenarios.

Methodology and Benchmark Design

MMSI-Bench was meticulously crafted by six researchers who devoted over 300 hours to formulating 1,000 sophisticated multiple-choice questions from a pool of more than 120,000 images. These questions are reinforced with distractors and detailed human-annotated reasoning processes to ensure clarity and challenge. The dataset spans diverse real-world scenes, including indoor settings, outdoor environments, autonomous driving datasets, robotics, and common activities, pushing the boundaries of spatial reasoning.

The benchmark encompasses a taxonomy of ten fundamental spatial reasoning tasks, which are crucial for understanding the positions, attributes, and motions of cameras, objects, and regions within a scene. A multi-step reasoning split further tests models' capabilities by requiring responses to long-horizon questions that integrate multiple reasoning tasks.

Evaluation Results

The evaluation conducted on 34 open-source and proprietary MLLMs highlights a substantial performance gap between models and humans in spatial reasoning abilities. The strongest open-source model registered an approximately 30% accuracy rate, while OpenAI's proprietary o3 reasoning model achieved up to 40%, significantly lower than the 97% accuracy exhibited by human performance. This performance disparity underscores the challenge posed by MMSI-Bench and indicates significant room for improvement in MLLM spatial reasoning.

Insights from Error Analysis

In tandem with automated error analysis facilitated by annotated reasoning processes, the paper identifies four dominant failure modes within the models: grounding errors, overlap-matching and scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors. This diagnostic approach not only provides valuable insights for ongoing work to enhance MLLMs' spatial intelligence but also showcases the potential to refine their architecture and training processes systematically.

Implications and Future Directions

The introduction of MMSI-Bench holds significant implications for the development of MLLMs, especially in enhancing their deployment in real-world applications that demand sophisticated spatial understanding. This benchmark serves as a rigorous litmus test for progress in large-scale model capabilities, particularly in advancing towards embodied AGI with robust multi-image spatial intelligence. Future developments in AI could leverage MMSI-Bench not only for evaluating current models but also for guiding architectural innovations necessary to bridge the notable gap in model-human performance identified in the paper.

In summary, MMSI-Bench provides a much-needed, challenging benchmark that serves both as a measure and a catalyst for progress in the field of spatial reasoning for MLLMs. Its design and comprehensive analysis underscore the need for continued research and development in this critical area of AI, with the ultimate aim of achieving human-level spatial intelligence in multimodal systems.

HackerNews

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence (2 points, 1 comment)