Determine MLLMs’ ability to capture cross-view consistency for 3D reasoning

Determine whether Multimodal Large Language Models can effectively capture the detailed spatial information required for robust real-world performance, specifically achieving cross-view consistency necessary for accurate 3D reasoning.

Background

Cross-view consistency is fundamental for accurate 3D reasoning tasks such as pose estimation, reconstruction, and SLAM, yet current Multimodal LLMs are primarily trained on 2D data emphasizing continuity between frames rather than 3D spatial integrity. The paper introduces Viewpoint Learning and the Viewpoint-100K dataset to evaluate and enhance spatial reasoning, but explicitly states uncertainty regarding the inherent capacity of existing models to capture the fine-grained spatial information needed for robust, real-world performance.

This uncertainty motivates the proposed two-stage fine-tuning strategy (Supervised Fine-Tuning on viewpoint tasks followed by Reinforcement Learning on broader spatial tasks) to activate spatial reasoning abilities, underscoring that establishing whether MLLMs can intrinsically understand cross-view consistency remains an explicit unresolved question.

References

However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning.

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models (2511.01618 - Zhan et al., 3 Nov 2025) in Abstract (p. 1)