Spatial Mental Modeling from Limited Views (2506.21458v1)

Published 26 Jun 2025 in cs.AI, cs.CL, and cs.CV

Abstract: Can Vision LLMs (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

Summary

The paper introduces the MindCube benchmark to evaluate spatial reasoning in vision-language models, revealing near-chance performance on occluded views.
It shows that prompting models to actively generate internal cognitive maps and engage in free-form reasoning leads to measurable accuracy gains.
Supervised fine-tuning combined with reinforcement learning significantly boosts performance, reducing the gap with human spatial cognition.

Spatial Mental Modeling from Limited Views: A Comprehensive Analysis

The paper "Spatial Mental Modeling from Limited Views" (2506.21458) addresses a fundamental challenge in vision-LLMs (VLMs): the ability to construct and reason over internal spatial representations from partial visual observations, akin to human spatial mental models. The authors introduce the MindCube benchmark to systematically evaluate and improve VLMs' spatial reasoning under limited and occluded views, and propose a series of methods to scaffold and train VLMs for robust spatial mental modeling.

Problem Formulation and Motivation

Current VLMs excel at passive perception but struggle with spatial reasoning in partially observable environments, particularly when required to infer the layout, positions, and relationships of objects not directly visible. Human cognition leverages spatial mental models—internal, flexible, and often schematic representations—to reason about unseen space, maintain cross-view consistency, and perform mental simulation (e.g., "what if" scenarios). The paper identifies a critical gap: state-of-the-art VLMs perform near chance on tasks requiring such mental modeling, as evidenced by their performance on the newly introduced MindCube benchmark.

MindCube Benchmark

MindCube is a large-scale, multi-view spatial reasoning benchmark comprising 21,154 questions across 3,268 images, organized into 976 multi-view groups. The benchmark is designed to probe three core spatial reasoning abilities:

Cognitive Mapping: Inferring the positions of objects from limited views.
Perspective-Taking: Reasoning about spatial relationships from different viewpoints, including those of other agents or objects.
Mental Simulation: Predicting outcomes of hypothetical movements or viewpoint changes.

The taxonomy of MindCube spans camera movement types (Rotation, Among, Around), visual patterns, "what-if" dynamics, relation queries (agent-object, object-object, etc.), and perspective-taking levels. The dataset is meticulously annotated to ensure that questions target occluded or invisible objects, requiring genuine spatial inference rather than surface-level pattern matching.

Baseline Evaluation and Key Findings

Seventeen state-of-the-art VLMs, including both open-weight and proprietary models, were evaluated on MindCube. The best-performing model (DeepSeek-VL2-Small) achieved only 47.62% accuracy, with most models performing marginally above random chance (32–38%). Notably, neither multi-image input nor spatial fine-tuning reliably improved performance, and no model demonstrated consistent strength across all spatial reasoning categories. Human annotators, by contrast, achieved over 94% accuracy, highlighting the substantial gap between current VLMs and human spatial cognition.

Scaffolding Spatial Reasoning: Data Structures and Prompting

The authors systematically investigate three data structures as cognitive scaffolds for spatial mental modeling:

View Interpolation: Generating intermediate views to provide perceptual continuity. This approach did not yield significant improvements, indicating that simply increasing visual input does not address the core reasoning deficit.
Cognitive Maps: Providing explicit, structured 2D representations of object layouts (with or without camera/viewpoint information). Directly supplying cognitive maps as input degraded performance, but prompting the model to generate a cognitive map before reasoning led to modest gains.
Free-Form Reasoning: Eliciting step-by-step natural language reasoning chains. This approach improved accuracy by 2.7% over the baseline, and combining map generation with reasoning yielded further improvements.

A key empirical finding is that explicit reasoning is essential: passive provision of structure (e.g., maps or interpolated views) is insufficient. Only when models are prompted to actively construct and reason over internal representations do they exhibit meaningful gains in spatial reasoning.

Supervised Fine-Tuning (SFT) and Internal Representation Learning

To address the intrinsic limitations of frozen VLMs, the authors curate 10,000 reasoning chains and 10,000 ground-truth cognitive maps for supervised fine-tuning. Several configurations are explored:

Raw QA: Fine-tuning on question-answer pairs alone raises accuracy from 37.8% to 52.3%.
Cognitive Map Generation: Fine-tuning to generate cognitive maps boosts map similarity and isomorphism rates (up to 73.8%), but only modestly improves QA accuracy.
Free-Form Reasoning: Fine-tuning on reasoning chains yields a 1.2% gain over the QA baseline.
Joint Map-Then-Reason: The most effective configuration, where the model is trained to first generate a cognitive map and then perform reasoning, achieves 60.8% accuracy—a gain of 8.5% over the QA baseline.

The training dynamics reveal that joint pressure from both map generation and reasoning tasks leads to functionally effective spatial representations, rather than mere structural replication. This synergy is critical for downstream inference.

Reinforcement Learning (RL) for Spatial Reasoning

Building on the SFT foundation, the authors employ reinforcement learning (RL) with a reward structure that incentivizes both structurally valid outputs and correct answers. RL from scratch provides only marginal gains, but initializing RL from the best SFT checkpoint yields a substantial improvement: accuracy rises to 70.7%, a 9.9% absolute gain over SFT alone. The RL process refines and polishes the internal spatial representations learned during SFT, pushing performance toward the upper bound observed in the paper.

Analysis and Ablations

Bottleneck Analysis: Fine-tuning only the LLM component of the VLM achieves nearly all the performance gains of full fine-tuning, indicating that the primary bottleneck for spatial reasoning lies in the reasoning module, not the visual encoder.
Curriculum SFT: A two-stage SFT strategy—first fine-tuning on simple QA, then on more complex scaffolds—further improves performance, suggesting that curriculum learning is beneficial for spatial reasoning tasks.
Failure Modes: Models often default to local relationship matching and proximity-based guessing when faced with occlusion or complex spatial arrangements, underscoring the need for improved transitive inference and global scene representation.

Implications and Future Directions

The findings have several important implications:

Practical Applications: Robust spatial mental modeling is essential for embodied AI, robotics, navigation, and any application requiring reasoning in partially observable or dynamic environments. The MindCube benchmark and the proposed training strategies provide a foundation for developing VLMs capable of such reasoning.
Theoretical Impact: The work demonstrates that spatial reasoning in VLMs is not an emergent property of large-scale pretraining or multi-image input alone. Instead, it requires explicit scaffolding, supervision, and reward-driven refinement of internal representations.
Future Research: The results suggest several avenues for further investigation:
- Development of higher-quality SFT datasets for cognitive map generation and reasoning.
- Exploration of novel training paradigms that further exploit the synergy between structured representation and flexible reasoning.
- Integration of spatial mental modeling capabilities into embodied agents and real-world systems.

Conclusion

"Spatial Mental Modeling from Limited Views" provides a rigorous, multi-faceted analysis of the limitations and potential of VLMs for spatial reasoning. The introduction of MindCube, the systematic evaluation of scaffolding strategies, and the demonstration of substantial gains through joint map-then-reason training and RL collectively establish a new standard for research in spatial cognition for AI. The work underscores the necessity of actively constructing and utilizing internal structured spatial representations, and sets a clear agenda for future progress in spatially intelligent AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ManlingLi_/status/1939760677133987952

https://twitter.com/MTue2551/status/1940335863986331776

https://twitter.com/Synced_Global/status/1938524667041787954