SpatialVLM: 3D Vision-Language Models

Updated 19 January 2026

SpatialVLM is a vision-language model class equipped with 3D spatial reasoning via metric supervision and multimodal data integration.
It employs innovations like depth plugins, LiDAR integration, and dual-stream architectures to fuse geometry and semantic cues.
It demonstrates strong performance in spatial VQA benchmarks and supports applications in robotics, autonomous driving, and 3D scene editing.

SpatialVLM (Spatial Vision-LLM) designates a class of vision-LLMs explicitly endowed with 3D spatial reasoning capabilities via architectural innovation, large-scale spatially grounded data generation, and multi-modal training protocols. Addressing the historical inability of standard VLMs to perform quantitative spatial understanding and metric reasoning, these systems integrate spatially explicit supervision—such as depth maps, LiDAR, semantically labeled 3D point clouds, and synthetic metric QA data—to learn both qualitative and quantitative spatial relations in real-world scenes. This entry collates the core paradigms, datasets, architectures, training methodologies, and empirical results from current SpatialVLM research, with emphasis on both foundational advances and representative instantiations (Chen et al., 2024, Liu et al., 26 May 2025, Wei et al., 30 Dec 2025, Cheng et al., 2024, Chen et al., 22 Sep 2025, Hu et al., 26 Nov 2025, Islam et al., 3 Oct 2025, Sun et al., 18 May 2025).

1. Motivation: Limitations of Conventional VLMs in Spatial Reasoning

Vision-LLMs such as GPT-4V and PaLM-E achieve state-of-the-art performance on image captioning and general visual QA but fail at tasks requiring quantitative 3D spatial reasoning—measuring distances, comparing elevations, or estimating real-world object sizes—because of the lack of metric 3D supervision in internet-scale image-text corpora. Typical 2D datasets encode only qualitative spatial relations (“to the left of,” “above”) and do not permit programmatic access to object-centric depth, pose, or metric distances (Chen et al., 2024). This deprives models of the inductive bias and supervision needed for metric geometric inference, leading to subpar performance in robotics, scene understanding, navigation, and decision-making applications where spatial precision is required.

2. Spatially Grounded Data Generation and Datasets

A primary breakthrough is the automatic synthesis of large-scale 3D spatial VQA datasets from real-world images. For instance, SpatialVLM generates up to 2 billion metric spatial VQA examples over 10 million images (Chen et al., 2024). The pipeline involves semantic filtering (to retain only natural scenes), object-centric mask extraction, monocular or multi-view depth estimation (using models such as ZoeDepth), and coordinate canonicalization with RANSAC plane fitting to align the vertical axis with gravity. For each detected object, the pipeline computes centroids, 3D bounding-box extents, pairwise metric distances (Euclidean in canonicalized world coordinates), and size ratios.

Spatial-ViLT constructs datasets with fine-grained depth maps, 3D coordinates, and edge maps obtained via MiDaS and geometric post-processing, enabling multi-task learning for 3D spatial relations (Islam et al., 3 Oct 2025).

The MSMU dataset (Chen et al., 22 Sep 2025) provides 700k VQA pairs with 2.5M metric annotations for tasks including scale estimation, absolute and relative position, reference-object reasoning, and existence, while the SUN-Spot v2.0 dataset (Sun et al., 18 May 2025) specializes in spatial referring expressions via region-marker prompting. Scene graph-derived datasets (OSD, SpatialRGBT-Bench) enable region- and instance-conditioned spatial QA (Cheng et al., 2024).

3. Model Architectures and Spatial Information Fusion

SpatialVLM systems adopt diverse architectural strategies to inject and fuse spatial signals:

Metric-Enriched Vision Encoder: Many models employ a ViT backbone for initial vision encoding and augment per-patch features with metric depth encodings. SD-VLM utilizes a Depth Positional Encoding: for each image patch, a sinusoidal embedding function of the real-world depth is added to the patch token, aligning visual tokens along the camera’s z-axis (Chen et al., 22 Sep 2025). Spatial-ViLT integrates decoders that reconstruct depth and 3D maps from the multimodal embedding layer, incentivizing the Transformer’s attention heads to acquire geometry-sensitive features (Islam et al., 3 Oct 2025).
Point Clouds and LiDAR Integration: For driving applications, LVLDrive integrates both 2D vision and LiDAR tokens through a Gradual Fusion Q-Former. Initially, only image features are attended to; during staged fine-tuning, LiDAR features (after BEV voxelization and positional embedding) are injected through attention layers controlled by learnable, zero-initialized gates, allowing progressive inclusion of 3D cues while preserving pre-trained visual-text alignment (Wei et al., 30 Dec 2025).
Dual-Stream Architectures with Geometry Experts: G $^2$ VLM employs a Mixture-of-Transformer-Experts with parallel semantic and geometric pathways: the geometric expert is trained for 3D point and pose reconstruction using only images, and its features are merged with the semantic (standard VLM) stream at each layer, enabling downstream semantic reasoning to directly leverage explicit geometry (Hu et al., 26 Nov 2025).
Depth “Plugin” and Region-Aware Reasoning: SpatialRGPT introduces an extensible “depth plugin” attached to standard VLMs, projecting both visual and depth feature maps and pooling them within specified user regions. Region-aware tokens are then routed to the LLM, which learns to predict metric relations (distances, angles, directions) implicitly from supervision (Cheng et al., 2024).

4. Training Objectives and Loss Functions

All models are trained via multi-task objectives. Standard next-token cross-entropy for textual answer generation is universally used. For spatial prediction tasks, additional losses include:

Metric Regression: Direct supervision for scalar metric outputs (distances, sizes) by L2 loss, e.g., $L_{L2} = \|\hat s - s^*\|^2$ (Chen et al., 22 Sep 2025).
Reconstruction Losses: For models reconstructing depth, 3D coordinate, or edge maps from the Transformer embeddings, L2 or cross-entropy losses enforce accuracy at the spatial feature level (Islam et al., 3 Oct 2025).
3D Perception and Grounding: Object detection heads compute bounding box regression loss (e.g. $\ell_1$ + GIoU) for explicit object localization in downstream applications such as autonomous driving (Wei et al., 30 Dec 2025).
Region–Token Alignment: For spatial referring expression understanding, region–token matching loss aligns visual features with textual markers, crucial in setups using Set-of-Marks prompting (Sun et al., 18 May 2025).
Geometry Pretraining Loss: In dual-expert (geometry + semantic) architectures, the geometry expert is pre-trained with point-map, pose, and normal reconstruction objectives; in joint training, either both experts or only the semantic head are updated for spatial reasoning (Hu et al., 26 Nov 2025).

5. Empirical Results and Spatial Reasoning Benchmarks

SpatialVLMs demonstrate significant improvement over prior VLMs on both qualitative and quantitative spatial tasks:

SpatialVLM (Chen et al., 2024): Achieves 75.2% accuracy on qualitative spatial VQA, +7% over GPT-4V, and outputs correct numeric answers in 99% of cases, with 37.2% in-range on quantitative benchmarks.
SD-VLM (Chen et al., 22 Sep 2025): Surpasses GPT-4o and Intern-VL3-78B on the MSMU-Bench by over 23 percentage points and generalizes to Q-Spatial++ and SpatialRGPT-Bench.
Spatial-RGPT (Cheng et al., 2024): Yields 91.8% qualitative and 41.2% quantitative QA success on its SpatialRGBT-Bench.
G $^2$ VLM/SpaceLM (Hu et al., 26 Nov 2025): Achieves 54.87 on SPAR-Bench, outperforming GPT-4o by 18.5 absolute points, attributed to explicit geometric reasoning from the geometry expert.
SpatialViLT (Islam et al., 3 Oct 2025): Improves spatial reasoning accuracy on the challenging VSR dataset, with ensemble models achieving 72.62% overall accuracy, outperforming LXMERT, ViLT, and SpaceLLaVA.
LVLDrive (Wei et al., 30 Dec 2025): Reduces open-loop planning L2 displacement error and collision rates compared to vision-only baselines, achieving higher BEV mIoU and language-based evaluation metrics.

Notably, ablation studies consistently demonstrate that spatial supervision—via depth, point clouds, or spatially annotated QA—directly improves metric spatial reasoning without degrading standard VQA or language understanding.

6. Applications and Downstream Use-Cases

SpatialVLMs have directly enabled new classes of downstream applications:

Chain-of-Thought Spatial QA: Modular LLMs coordinate with SpatialVLMs by issuing primitive geometric sub-queries (distances, relative positions), then reasoning over results to answer compositional queries (e.g., “Do these objects form a triangle?”) (Chen et al., 2024).
Robotics Reward Annotation: Direct, dense reward shaping is possible by querying metric relations between robot actuators and targets, yielding smooth and well-shaped reward landscapes (Chen et al., 2024, Cheng et al., 2024).
Autonomous Driving: By explicit LiDAR and image fusion, subject to spatially aware QA supervision, LVLDrive achieves reliable policy generation for navigation and obstacle avoidance (Wei et al., 30 Dec 2025).
3D Scene Generation/Editing: With hierarchical spatial contexts comprising scene portraits, point clouds, and hypergraphs of spatial relations, VLMs can generate, edit, and verify structured 3D environments, supporting interactive editing and path planning (Liu et al., 26 May 2025).
Region-Aware Visual QA: SpatialVLMs with region-token alignment enable precise responses to spatial referring expressions and object localization queries (Sun et al., 18 May 2025).

7. Open Problems and Future Directions

Despite rapid progress, key challenges persist:

Generalization: Small or numerous objects may still be missed in context construction; scaling to urban-scale 3D environments remains non-trivial (Liu et al., 26 May 2025).
Expressivity of Spatial Constraints: Current spatial formalizations are limited to unary, binary, and some ternary relations; dynamic affordances and learning richer constraints is an open area (Liu et al., 26 May 2025).
Integration with Real-Time Agents: Tightening loops between spatial reasoning and embodied agents (e.g., robotics simulators) is a major future direction, as is scaling to dynamic scenes and motion-based queries (Liu et al., 26 May 2025, Hu et al., 26 Nov 2025).

In summary, SpatialVLM research establishes that internet-scale spatial QA, direct metric supervision, and explicit multi-modal geometry fusion equip VLMs with robust 3D spatial intelligence, bridging the gap from surface semantic reasoning to quantitative spatial understanding and enabling a wide spectrum of embodied and generative applications (Chen et al., 2024, Liu et al., 26 May 2025, Wei et al., 30 Dec 2025, Islam et al., 3 Oct 2025, Chen et al., 22 Sep 2025, Hu et al., 26 Nov 2025, Cheng et al., 2024, Sun et al., 18 May 2025).