Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence (2505.23747v1)

Published 29 May 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Recent advancements in Multimodal LLMs (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Summary

The paper introduces Spatial-MLLM, which fuses a 2D visual encoder with a spatial encoder to integrate semantic features and implicit 3D structure from 2D videos.
It employs a novel space-aware frame sampling strategy via a greedy maximum coverage algorithm to select frames that maximize spatial context.
Extensive experiments on benchmarks like VSI-Bench and ScanQA demonstrate that the model, with only 4B parameters, outperforms many larger models and those requiring explicit 3D data.

Here is a summary of the paper "Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence" (2505.23747).

The paper addresses the challenge of enhancing the visual-based spatial intelligence of Multimodal LLMs (MLLMs), particularly those processing video inputs, when only 2D observations are available. Existing video MLLMs often struggle with tasks requiring fine-grained 3D scene understanding and reasoning because their visual encoders, typically trained on 2D image-text data (like CLIP), are optimized for semantic content but lack the structural prior necessary to infer 3D geometry from 2D inputs. While 3D MLLMs exist, they usually require explicit 3D or 2.5D data (like point clouds or depth maps), which are not always available in real-world scenarios.

To tackle this, the authors propose Spatial-MLLM, a novel framework designed to infer global spatial layout and relationships directly from 2D video sequences. The core idea is to leverage the strong structure prior from feed-forward visual geometry foundation models, which are trained on pixel-point correspondences and can recover 3D information from 2D inputs, and combine this with semantic features from standard 2D visual encoders.

The Spatial-MLLM architecture consists of:

Dual-Encoder:
- A 2D visual encoder (initialized from a general-purpose video MLLM's visual encoder, such as Qwen2.5-VL) extracts rich 2D semantic features from input frames.
- A spatial encoder (initialized from the feature backbone of a visual geometry model like VGGT) extracts implicit 3D structural information across frames.
Connector: A lightweight module, implemented with MLPs, fuses the features from the 2D and spatial encoders into unified visual tokens. This integrated representation contains both semantic and structural information, enabling the LLM backbone to perform spatial reasoning without needing explicit 3D data.
LLM Backbone: An LLM (e.g., Qwen2.5-VL's backbone) processes the fused visual tokens along with textual prompts to generate responses.

A practical challenge in video MLLMs is handling long video sequences with limited token context length, especially for spatial tasks where observing varied viewpoints is crucial. Spatial-MLLM introduces a space-aware frame sampling strategy used during inference. This strategy utilizes the spatial encoder's capability to decode 3D features into a voxel grid representation of the scene. It then formulates frame selection as a maximum coverage problem over these voxels, solved using a greedy algorithm to choose a limited number of frames that provide the most comprehensive spatial coverage of the scene.

The algorithm for greedy maximum coverage sampling is as follows:

def greedy_max_coverage_sampling(frame_voxel_sets, target_size):
    """
    Selects frames with maximum spatial coverage using a greedy algorithm.

    Args:
        frame_voxel_sets (list): List of sets, where each set contains
                                 voxels covered by a frame.
        target_size (int): The desired number of frames to select.

    Returns:
        list: Indices of the selected frames.
    """
    num_frames = len(frame_voxel_sets)
    selected_indices = []
    covered_voxels = set()
    remaining_candidates = list(range(num_frames))

    for _ in range(target_size):
        if not remaining_candidates:
            break

        best_frame_index = -1
        max_new_coverage = -1

        for i in remaining_candidates:
            new_coverage = len(frame_voxel_sets[i] - covered_voxels)
            if new_coverage > max_new_coverage:
                max_new_coverage = new_coverage
                best_frame_index = i

        if max_new_coverage == 0:
            break

        selected_indices.append(best_frame_index)
        covered_voxels.update(frame_voxel_sets[best_frame_index])
        remaining_candidates.remove(best_frame_index)

    return selected_indices

To train Spatial-MLLM, the authors constructed a new visual-based spatial question-answering dataset called Spatial-MLLM-120k. This dataset is derived from ScanNet scenes and includes QA pairs covering tasks like object counting, size, distance, direction, and appearance order. Training follows a two-stage pipeline:

Supervised Fine-Tuning (SFT): The model is initially fine-tuned on Spatial-MLLM-120k. The 2D and spatial encoders are frozen, while the connector and LLM backbone are trained using a standard cross-entropy loss.
Reinforcement Learning (RL): After an initial "cold start" phase using a small CoT dataset generated by a larger model, the model is further trained using Group Relative Policy Optimization (GRPO). This stage is designed to enhance long-chain-of-thought spatial reasoning capabilities. Task-dependent reward functions are used, including exact match for multiple-choice, mean relative accuracy for numerical answers, and Levenshtein distance for verbal answers, along with a reasoning length reward.

The cold start dataset construction process involves sampling a subset of the training data, generating multiple reasoning paths and answers using a larger model (Qwen2.5-VL-72B), computing rewards for each generated answer based on ground truth, and adaptively filtering the generated CoT examples based on question-type dependent reward thresholds to ensure a balanced and high-quality dataset for initial RL alignment.

Input: Original dataset D, Qwen2.5-VL model M, Reward function Reward,
         Sample size N_s, Paths per item K

Initialize cold_start_dataset = empty set
Sample subset D_0 = Sampling(D, N_s)
Store_rewards = dictionary mapping item index to list of rewards
Store_outputs = dictionary mapping item index to list of (Thought, Answer) pairs

For each item I_i = <Q_i, A_i, V_i, M_i> in D_0:
    Generated_outputs = []
    Item_rewards = []
    For k from 1 to K:
        # Generate reasoning path and answer using M
        Generated_T_k, Generated_A_k = M(V_i, Q_i)
        Reward_k = Reward(Generated_A_k, A_i)
        Generated_outputs.append((Generated_T_k, Generated_A_k))
        Item_rewards.append(Reward_k)
    Store_rewards[i] = Item_rewards
    Store_outputs[i] = Generated_outputs

Best_rewards_per_item = dictionary mapping item index to max reward
Best_outputs_per_item = dictionary mapping item index to (Thought, Answer) pair with max reward

For each item I_i in D_0:
    best_k = argmax(Store_rewards[i])
    Best_rewards_per_item[i] = Store_rewards[i][best_k]
    Best_outputs_per_item[i] = Store_outputs[i][best_k]

Group items by type: Grouped_items = GroupByType(D_0)

Type_thresholds = dictionary mapping question type to threshold

For each type t in Grouped_items:
    Rewards_for_type_t = [Best_rewards_per_item[i] for i in Grouped_items[t]]
    # Compute 50th percentile reward for this type
    Type_thresholds[t] = Quantile(Rewards_for_type_t, 0.5)

For each item I_i in D_0:
    item_type = GetTypeOf(I_i)
    max_reward = Best_rewards_per_item[i]
    if max_reward >= Type_thresholds[item_type] and max_reward > 0:
        cold_start_dataset.add(Best_outputs_per_item[i])

Output: cold_start_dataset

Extensive experiments were conducted on benchmarks like VSI-Bench, ScanQA, and SQA3D. Results demonstrate that Spatial-MLLM achieves state-of-the-art performance across a wide range of visual-based spatial understanding and reasoning tasks. Notably, Spatial-MLLM, despite having only 4 billion parameters and using a limited number of input frames (e.g., 16), significantly outperforms many larger models (up to 72B parameters) and even some models that require explicit 3D or 2.5D data input. Ablation studies confirm the effectiveness of both the dual-encoder architecture and the space-aware frame sampling strategy.

Limitations include potential for scaling up the model and training data further, and the current focus primarily on spatial intelligence tasks rather than general video understanding. Future work could explore the benefits of integrating spatial structural information for broader video reasoning tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/fangfu0830/status/1928280728963318166

https://twitter.com/HuggingPapers/status/1928483632852733994

https://twitter.com/javaeeeee1/status/1928391863142162822

https://twitter.com/arxivsanitybot/status/1928645377298760149