Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MindCube Benchmark

Updated 1 July 2025

The MindCube Benchmark is a comprehensive evaluation suite designed to rigorously assess and advance vision-language models' capacity for spatial reasoning and building mental models from limited visual input.
It includes a large dataset of multi-view scenes and over 21,000 spatial reasoning questions covering occlusion, perspective-taking, and dynamic "what-if" scenarios to expose current VLM limitations.
Evaluations demonstrate that requiring models to explicitly construct and reason over cognitive maps significantly improves performance on spatial tasks, highlighting the importance of internal scaffolding for advanced scene understanding.

The MindCube Benchmark is a vision-language evaluation suite designed to rigorously assess and advance the capacity of Vision-LLMs (VLMs) to form and utilize spatial mental models—that is, internal schematic representations of scenes constructed from limited visual input. MindCube exposes the limitations of current VLMs in scene understanding under partial observation, occlusion, and hypothetical ("what-if") scenario reasoning, and systematically evaluates progress toward human-level spatial inference.

1. Structure and Scope

MindCube comprises 3,268 images grouped into 976 multi-view scene sets, with a total of 21,154 spatial reasoning questions. Images were curated from established sources such as ArkitScenes, DL3DV-10K, WildRGB-D, and self-collected photo sets, ensuring diversity in scene layout, viewpoint, object composition, and ambiguity through occlusion or motion. For each multi-view group, images record specific camera movement patterns:

Rotation: In-place camera rotation.
Around: Camera moves around one or more objects along a path.
Among: Camera traverses among several objects.
Combinations: Complex movement series, blending the above.

Each group is meticulously annotated across four axes: spatial relationships, object groupings, semantic orientations (including 'facing' or directionality), and occlusion visibility. Scenes present real-world ambiguities, perspective changes, and object–object, agent–object, and agent–agent spatial relations.

2. Question and Task Taxonomy

MindCube’s dataset organizes spatial reasoning questions into a fine-grained taxonomy:

Spatial reasoning under occlusion: Inferring the existence or properties of unseen or occluded entities.
Perspective-taking: Both Level 1 (what is visible from a new pose) and Level 2 (reasoning about another entity’s perspective).
Dynamic “what-if” reasoning: Hypothetical interpretations of the scene following specified rotations, translations, or motion sequences.
Relational queries: Examining the spatial layout, e.g., "Which object will be on your left after rotating 90°?" or "If you move to location X, what will be in front of you?"

Query construction leverages the combinatorial annotation scheme for exhaustive coverage, yielding 21,154 QA instances covering multiple spatial and dynamic reasoning demands.

3. Evaluation Methodology and Metrics

Models are evaluated using standardized prompt templates. Each prompt supplies one or more images (limited views), potentially along with other information (e.g., partial map, reasoning chain), and requests an answer in a forced-choice or free-form format.

Core metric:

$\text{QA Accuracy} = \frac{N_{\text{correct}}}{N_{\text{total}}} \times 100\%$

In experiments involving the generation of explicit cognitive maps, further graph-based metrics are used:

Coverage: Proportion of ground-truth objects present in the map.

$\mathrm{Cov} = \frac{|\mathcal{O}^c|}{|\mathcal{O}^*|}$

Directional Similarity:

$S_{\text{dir}} = \frac{|\{(o_i,o_j)\in\mathcal{P}^* \mid R^g(o_i,o_j) = R^*(o_i,o_j)\}|}{|\mathcal{P}^*|}$

Facing Similarity:

$S_{\text{face}} = \frac{|\{o_i \in \mathcal{F}^* \mid f^g_i = f^*_i\}|}{|\mathcal{F}^*|}$

Overall Similarity:

$S_{\text{overall}} = \alpha S_{\text{dir}} + (1 - \alpha) S_{\text{face}}, \quad \text{with } \alpha = 0.7$

Isomorphism: Structural equivalence up to 90° rotations of map relation matrices.

Additional metrics include the valid cognitive map rate (percentage of well-formed and semantically valid outputs).

4. Baseline and Model Performance

MindCube provides a systematic comparison of 17 vision-LLMs, encompassing both open-weight and proprietary architectures. Categories include generic VLMs, spatially fine-tuned models, and models employing different forms of cognitive scaffolding.

Human ceiling performance: ≈95%.
Best baseline VLM (DeepSeek-VL2-Small): ≈48% accuracy.
Text-only ablation: Marked drop in performance, confirming the need for visually grounded reasoning.

Category-specific and per-model breakdown shows no current model approaches human performance, and consistent difficulty is observed across occlusion-heavy and multi-dynamic view tasks.

5. Cognitive Scaffolding Approaches

To examine what facilitates spatial mental model construction, the benchmark evaluates three principal approaches:

View Interpolation: Interleaving additional (synthesized) in-between views; found to provide no significant gain over the baseline, indicating that mere visual continuity is insufficient.
Natural Language Reasoning Chains (Chain-of-Thought): Prompting stepwise justification before answer selection. This yields a modest improvement (Raw-QA: 37.8% → Reasoning: 40.5%).
Cognitive Maps: Explicit mapping of object positions, orientations, and viewpoints on an allocentric grid, either as input or as an intermediate output. The central result is that providing maps as static input is not beneficial unless the model is actively trained to reason over or construct them.

Key finding: The "map-then-reason" strategy—where models are jointly trained to produce a cognitive map as an explicit intermediate representation and then reason about the scene—gives the largest improvement. Supervised fine-tuning with this strategy increases QA accuracy to 60.8% (+23.0% absolute). Applying reinforcement learning (VAGEN algorithm, Group Relative Policy Optimization, reward: +1 for valid output, +5 for correct answer) further boosts performance to 70.7% (+32.9%).

Method	Accuracy (%)
Raw-QA	37.81
Reasoning Only	40.48
Map-then-Reason (SFT)	60.76
Map-then-Reason (SFT + RL)	70.67
Human Upper Bound	≈95

All results correspond to the MindCube-Tiny subset and the Qwen2.5-VL-3B-Instruct model.

6. Technical Implementation and Practical Considerations

Fine-tuning experiments employ 10,000 QA pairs, three epochs, learning rate $1e^{-5}$ , batch size 256, and AdamW optimizer. RL fine-tuning uses batch size 32. The "map-then-reason" pipeline is implemented via a two-stage prompt: the model first outputs a JSON cognitive map (10×10 grid with object pose and type data), then produces an answer conditioned on this intermediate map.

Full prompt templates, input-output JSON schemas, and example outputs are specified in the MindCube appendix. An explicit directional relation function for objects is used: $\mathrm{dir}(o_i,o_j) = \text{right, left, up, down, inner}$ depending on grid positions, as formally defined in the benchmark formula list.

7. Impact and Implications

MindCube establishes a rigorous standard and a diagnostic lens for spatial reasoning progress in VLMs under limited-view, real-world scene understanding scenarios. The central insight is that neither the simple addition of visual data nor passive reliance on spatial structures suffices. Rather, internal cognitive scaffolding—actively constructing a cognitive map and reasoning over it via multi-stage training—is both necessary and effective for approaching human-like scene understanding.

Current systems are revealed to be reasoning-bottlenecked rather than perception-bottlenecked. MindCube thus catalyzes next-generation research into vision-LLMs with improved generalization to occluded, dynamic, and hypothetical scenarios, and provides a reproducible basis for comparing advances in visual-spatial intelligence.

Summary Table: MindCube Key Elements

Aspect	Description
Purpose	Evaluate/build VLMs' spatial mental models from partial views
Dataset	3,268 images, 21,154 QAs, 976 groups, controlled for occlusion
Tasks	Scene reconstruction, "what-if" reasoning, perspective-taking
Metrics	QA accuracy, map validity, similarity, isomorphism
Best Method	Map-then-reason (SFT + RL)
Accuracy Base → Best	37.8% → 60.8% (SFT), 70.7% (RL + SFT)
Human Upper Bound	≈95%

PDF Markdown Chat (Upgrade)