- The paper introduces Spatial-MLLM, which fuses a 2D visual encoder with a spatial encoder to integrate semantic features and implicit 3D structure from 2D videos.
- It employs a novel space-aware frame sampling strategy via a greedy maximum coverage algorithm to select frames that maximize spatial context.
- Extensive experiments on benchmarks like VSI-Bench and ScanQA demonstrate that the model, with only 4B parameters, outperforms many larger models and those requiring explicit 3D data.
Here is a summary of the paper "Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence" (2505.23747).
The paper addresses the challenge of enhancing the visual-based spatial intelligence of Multimodal LLMs (MLLMs), particularly those processing video inputs, when only 2D observations are available. Existing video MLLMs often struggle with tasks requiring fine-grained 3D scene understanding and reasoning because their visual encoders, typically trained on 2D image-text data (like CLIP), are optimized for semantic content but lack the structural prior necessary to infer 3D geometry from 2D inputs. While 3D MLLMs exist, they usually require explicit 3D or 2.5D data (like point clouds or depth maps), which are not always available in real-world scenarios.
To tackle this, the authors propose Spatial-MLLM, a novel framework designed to infer global spatial layout and relationships directly from 2D video sequences. The core idea is to leverage the strong structure prior from feed-forward visual geometry foundation models, which are trained on pixel-point correspondences and can recover 3D information from 2D inputs, and combine this with semantic features from standard 2D visual encoders.
The Spatial-MLLM architecture consists of:
- Dual-Encoder:
- A 2D visual encoder (initialized from a general-purpose video MLLM's visual encoder, such as Qwen2.5-VL) extracts rich 2D semantic features from input frames.
- A spatial encoder (initialized from the feature backbone of a visual geometry model like VGGT) extracts implicit 3D structural information across frames.
- Connector: A lightweight module, implemented with MLPs, fuses the features from the 2D and spatial encoders into unified visual tokens. This integrated representation contains both semantic and structural information, enabling the LLM backbone to perform spatial reasoning without needing explicit 3D data.
- LLM Backbone: An LLM (e.g., Qwen2.5-VL's backbone) processes the fused visual tokens along with textual prompts to generate responses.
A practical challenge in video MLLMs is handling long video sequences with limited token context length, especially for spatial tasks where observing varied viewpoints is crucial. Spatial-MLLM introduces a space-aware frame sampling strategy used during inference. This strategy utilizes the spatial encoder's capability to decode 3D features into a voxel grid representation of the scene. It then formulates frame selection as a maximum coverage problem over these voxels, solved using a greedy algorithm to choose a limited number of frames that provide the most comprehensive spatial coverage of the scene.
The algorithm for greedy maximum coverage sampling is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
def greedy_max_coverage_sampling(frame_voxel_sets, target_size):
"""
Selects frames with maximum spatial coverage using a greedy algorithm.
Args:
frame_voxel_sets (list): List of sets, where each set contains
voxels covered by a frame.
target_size (int): The desired number of frames to select.
Returns:
list: Indices of the selected frames.
"""
num_frames = len(frame_voxel_sets)
selected_indices = []
covered_voxels = set()
remaining_candidates = list(range(num_frames))
for _ in range(target_size):
if not remaining_candidates:
break
best_frame_index = -1
max_new_coverage = -1
for i in remaining_candidates:
new_coverage = len(frame_voxel_sets[i] - covered_voxels)
if new_coverage > max_new_coverage:
max_new_coverage = new_coverage
best_frame_index = i
if max_new_coverage == 0:
break
selected_indices.append(best_frame_index)
covered_voxels.update(frame_voxel_sets[best_frame_index])
remaining_candidates.remove(best_frame_index)
return selected_indices |
To train Spatial-MLLM, the authors constructed a new visual-based spatial question-answering dataset called Spatial-MLLM-120k. This dataset is derived from ScanNet scenes and includes QA pairs covering tasks like object counting, size, distance, direction, and appearance order. Training follows a two-stage pipeline:
- Supervised Fine-Tuning (SFT): The model is initially fine-tuned on Spatial-MLLM-120k. The 2D and spatial encoders are frozen, while the connector and LLM backbone are trained using a standard cross-entropy loss.
- Reinforcement Learning (RL): After an initial "cold start" phase using a small CoT dataset generated by a larger model, the model is further trained using Group Relative Policy Optimization (GRPO). This stage is designed to enhance long-chain-of-thought spatial reasoning capabilities. Task-dependent reward functions are used, including exact match for multiple-choice, mean relative accuracy for numerical answers, and Levenshtein distance for verbal answers, along with a reasoning length reward.
The cold start dataset construction process involves sampling a subset of the training data, generating multiple reasoning paths and answers using a larger model (Qwen2.5-VL-72B), computing rewards for each generated answer based on ground truth, and adaptively filtering the generated CoT examples based on question-type dependent reward thresholds to ensure a balanced and high-quality dataset for initial RL alignment.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
Input: Original dataset D, Qwen2.5-VL model M, Reward function Reward,
Sample size N_s, Paths per item K
Initialize cold_start_dataset = empty set
Sample subset D_0 = Sampling(D, N_s)
Store_rewards = dictionary mapping item index to list of rewards
Store_outputs = dictionary mapping item index to list of (Thought, Answer) pairs
For each item I_i = <Q_i, A_i, V_i, M_i> in D_0:
Generated_outputs = []
Item_rewards = []
For k from 1 to K:
# Generate reasoning path and answer using M
Generated_T_k, Generated_A_k = M(V_i, Q_i)
Reward_k = Reward(Generated_A_k, A_i)
Generated_outputs.append((Generated_T_k, Generated_A_k))
Item_rewards.append(Reward_k)
Store_rewards[i] = Item_rewards
Store_outputs[i] = Generated_outputs
Best_rewards_per_item = dictionary mapping item index to max reward
Best_outputs_per_item = dictionary mapping item index to (Thought, Answer) pair with max reward
For each item I_i in D_0:
best_k = argmax(Store_rewards[i])
Best_rewards_per_item[i] = Store_rewards[i][best_k]
Best_outputs_per_item[i] = Store_outputs[i][best_k]
Group items by type: Grouped_items = GroupByType(D_0)
Type_thresholds = dictionary mapping question type to threshold
For each type t in Grouped_items:
Rewards_for_type_t = [Best_rewards_per_item[i] for i in Grouped_items[t]]
# Compute 50th percentile reward for this type
Type_thresholds[t] = Quantile(Rewards_for_type_t, 0.5)
For each item I_i in D_0:
item_type = GetTypeOf(I_i)
max_reward = Best_rewards_per_item[i]
if max_reward >= Type_thresholds[item_type] and max_reward > 0:
cold_start_dataset.add(Best_outputs_per_item[i])
Output: cold_start_dataset |
Extensive experiments were conducted on benchmarks like VSI-Bench, ScanQA, and SQA3D. Results demonstrate that Spatial-MLLM achieves state-of-the-art performance across a wide range of visual-based spatial understanding and reasoning tasks. Notably, Spatial-MLLM, despite having only 4 billion parameters and using a limited number of input frames (e.g., 16), significantly outperforms many larger models (up to 72B parameters) and even some models that require explicit 3D or 2.5D data input. Ablation studies confirm the effectiveness of both the dual-encoder architecture and the space-aware frame sampling strategy.
Limitations include potential for scaling up the model and training data further, and the current focus primarily on spatial intelligence tasks rather than general video understanding. Future work could explore the benefits of integrating spatial structural information for broader video reasoning tasks.