Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Published 29 May 2025 in cs.CV, cs.AI, and cs.LG | (2505.23747v1)

Abstract: Recent advancements in Multimodal LLMs (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a dual-encoder design that fuses semantic and geometric features from 2D videos to enhance spatial reasoning.
It employs space-aware frame sampling to maximize spatial diversity in video inputs while reducing redundancy.
Empirical results demonstrate that Spatial-MLLM outperforms benchmarks in spatial QA tasks with efficient use of limited input frames.

Enhancing Visual-Based Spatial Intelligence in Multimodal LLMs: Spatial-MLLM Framework

Motivation and Problem Statement

The paper "Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence" (2505.23747) targets the limitations of Multimodal LLMs (MLLMs) in spatial intelligence when processing visual inputs, particularly monocular 2D videos. While advances in MLLMs have substantially improved contextual and semantic reasoning from images and videos, spatial understanding from 2D inputs remains underdeveloped. Existing 3D MLLMs often rely on additional modalities such as point clouds, depth maps, or camera parameters, constraining applicability in environments where only standard video data is available. This work proposes a solution for spatial reasoning solely from 2D video streams, aiming to bridge the gap between semantic comprehension and holistic spatial scene understanding.

Architecture: Dual-Encoder Design and Feature Integration

Spatial-MLLM introduces a novel dual-encoder architecture, combining a standard 2D visual encoder—initialized from an established video MLLM (e.g., Qwen2.5-VL-3B)—with a spatial encoder leveraging the backbone of a feed-forward visual geometry foundation model (VGGT). The semantic encoder extracts high-level content features, while the spatial encoder recovers implicit 3D structural cues from 2D observations using frame-wise and global self-attention mechanisms. Features from both streams are integrated by a lightweight connector, aligning spatial and semantic information at the patch level and fusing them through MLPs to form unified visual tokens. This composite representation is then utilized by the LLM backbone for downstream spatial reasoning tasks, remaining agnostic to explicit 3D or 2.5D data.

This approach addresses the deficiency of CLIP-based visual encoders, which are optimized for semantic feature extraction through extensive image-text pretraining but lack geometric and spatial supervisory signals. By leveraging a geometry foundation model trained on pixel-point pairs, Spatial-MLLM augments spatial priors without modifying the input requirements or increasing computational overhead substantially.

Inference Optimization: Space-Aware Frame Sampling

Spatial-MLLM further introduces a space-aware frame sampling strategy for efficient inference under constrained input-length scenarios (e.g., VRAM limitations). Unlike uniform frame sampling, the proposed method prioritizes frames that maximally cover distinct spatial regions of the scene. This is operationalized by extracting depth and camera parameters via the spatial encoder, mapping candidate frames to voxelized scene representations, and employing a greedy maximum coverage algorithm to select frames with maximal spatial diversity. Empirical analysis demonstrates superior spatial coverage compared to uniform sampling, reducing redundancy and preserving transient regions.

Dataset Construction and Training Paradigm

To facilitate model training, the authors curate Spatial-MLLM-120k—a visual-based spatial QA dataset constructed from ScanQA, SQA3D, and additional synthetic spatial QA pairs derived from ScanNet scenes. Questions encompass numerical, multiple-choice, and verbal formats, covering object counting, size estimation, spatial relationships, and appearance order. The dual-stage training pipeline consists of:

Supervised Fine-Tuning (SFT): Connection module and LLM backbone are trained with cross-entropy loss while encoders remain frozen.
Reinforcement Learning with GRPO: To improve chain-of-thought (CoT) spatial reasoning, Group Relative Policy Optimization is applied, with reward mechanisms tailored for each QA type and reasoning format. Cold-start phase aligns model outputs with chain-of-thought expectations.

Numerical Results and Empirical Claims

Spatial-MLLM demonstrates state-of-the-art results across benchmarks—most notably VSI-Bench, ScanQA, and SQA3D. Despite the modest scale of 4B parameters, the model outperforms proprietary and open-source MLLMs, including those with much higher parameter counts and proprietary access to multimodal data:

VSI-Bench: Spatial-MLLM achieves an average accuracy 3.0% higher than Gemini-1.5 Pro, while using substantially fewer input frames (16 vs. 85 on average).
ScanQA (Val): Surpasses all video-input models and most models employing explicit 3D/2.5D input, with BLEU-1 and CIDEr scores notably higher.
SQA3D (Test): Outperforms video-input baselines in EM-1 and EM-R metrics, nearly matching or exceeding models requiring 3D scene information.

Ablation studies validate the incremental contributions from RL training and spatial architecture, as well as the superiority of space-aware frame sampling over uniform alternatives.

Practical and Theoretical Implications

Spatial-MLLM demonstrates that geometry-aware prior integration can significantly enhance spatial intelligence in MLLMs, even absent explicit geometric inputs. Practical implications include broader deployment in robotics, AR/VR, and video QA applications where only monocular video is available. Theoretically, this dual-encoder paradigm opens avenues for implicit spatial layout inference, suggesting that spatial priors, when fused appropriately with semantic encoders, can match or exceed performance requiring explicit 3D data.

The architecture is scalable and model-agnostic, with future directions including larger parameter budgets, broader visual reasoning tasks, and exploration of more complex feature fusion (e.g., cross-attention) to further leverage spatial structure.

Conclusion

Spatial-MLLM offers a robust framework for spatial reasoning from 2D visual inputs, integrating semantic and geometric priors through a dual-encoder design and optimizing frame selection via space-aware sampling. The empirical results substantiate strong spatial understanding without reliance on additional scene modalities. This work demonstrates that multimodal LLMs can achieve spatial intelligence more broadly, with practical efficacy and theoretical scalability, making it a substantial contribution to visual reasoning research.

Markdown Report Issue