SpatialBot: Vision-Language Spatial Reasoning

Updated 19 November 2025

SpatialBot models are a family of vision-language systems that combine RGB images and depth maps to achieve metric and relation-aware spatial reasoning.
They employ a modular pipeline with a SigLIP encoder and transformer-based LLMs, optimized via staged pretraining and fine-tuning on dedicated spatial datasets.
SpatialBot demonstrates significant improvements in depth estimation and robotic manipulation, supporting rigorous benchmarking in embodied AI tasks.

SpatialBot encompasses a family of models and frameworks designed for rigorous spatial reasoning in robotics and embodied AI. Most contemporary usage refers to vision-LLMs (VLMs) that ingest both RGB and depth images to achieve metric and relation-aware spatial understanding for manipulation, navigation, and evaluation. Key systems within this paradigm leverage a fusion of deep learning architectures for high-fidelity depth reasoning, supported by dedicated datasets and benchmarks enabling quantitative measurement of spatial comprehension against state-of-the-art baselines.

1. Architectural Foundations

SpatialBot VLMs adhere to a modular pipeline that combines visual representation learning with LLMs for multimodal reasoning (Cai et al., 2024). The essential components include:

Dual-Modality Input: Receives both an RGB image $I \in \mathbb{R}^{H \times W \times 3}$ and a depth map $D \in \mathbb{R}^{H \times W \times k}$ , where $k = 1$ for raw depth or $k = 3$ for encoded depth channels.
Vision Encoder: Both modalities are processed through the SigLIP encoder, yielding dense token embeddings $f_{\text{RGB}}(I)$ and $f_{\text{depth}}(D)$ .
Multi-Modal Projector and Fusion: Embeddings are concatenated or fused to form an input sequence $z = \text{Fuse}(f_{\text{RGB}}(I), f_{\text{depth}}(D))$ for the LLM.
LLM Backbone: The concatenated visual embedding is provided to a frozen or lightly fine-tuned transformer (Phi-2 3B, Phi-3 4B, Qwen-1.5 4B, Llama3-8B).
Depth API: Enables dynamic querying of depth values through tokens like “Depth(x,y)” during decoding, with live metric feedback from $D[x,y]$ .

This configuration supports precise interpretation and manipulation of 3D scenes, bridging pixelwise depth, object relations, and global spatial reasoning (Cai et al., 2024).

2. Training Objectives and Losses

SpatialBot models are optimized using staged supervised pretraining and fine-tuning procedures:

Language Modeling Loss: Standard cross-entropy loss across output tokens is applied to all QA, description, and spatial instruction outputs:

$\mathcal{L} = -\sum_t \log P(y_t \mid \text{history}, \text{vision})$

Robot Embodiment Loss: For manipulation tasks, the model predicts the 7DoF end-effector pose as a multi-stream classification (discretized into 101 bins per degree),

$\mathcal{L}_{\text{embodiment}} = \sum_{j=1}^7 \mathcal{L}_{\text{CE}}(\theta_j)$

accumulated alongside QA and control supervision.

Multi-Task Loss Aggregation: When combining datasets, losses are not weighted; batch composition is balanced, and optional LoRA adapters stabilize tuning.

Training schedules involve pretraining on LAION-2M, followed by fine-tuning with Bunny_695k, SpatialQA, and SpatialQA-E.

3. SpatialQA and Data Infrastructure

Advancing depth reasoning necessitates purpose-built datasets. SpatialBot is trained on SpatialQA, comprising multi-level RGB-D scenes annotated for varied spatial queries (Cai et al., 2024):

Level	Example Tasks	Volume (images)
Low-level	Pixelwise depth query, map description	$\sim$ 20,000
Middle-level	Object depth statistics, proximity ranking	$\sim$ 40,000
High-level	Spatial relations, counting, referee queries	$\sim$ 695,000

Additional sources (KITTI, NYU-Depth, RT-X, SA-1B, 2D-3DS) extend coverage of natural and synthetic scenes. Depth is losslessly encoded (uint24; channels), supporting consistent quantification. Each scene is annotated with 2–3 QA pairs to maximize data efficiency in spatial question answering.

4. Benchmarking and Performance Metrics

SpatialBot’s spatial reasoning capabilities are rigorously assessed on dedicated and general-purpose evaluation platforms (Cai et al., 2024):

SpatialBench: 120 scenes, six categories—depth estimation, positional relations, existence, counting, reach/touch, size comparison. Accuracy is computed per task.
General VLM Benchmarks: MME (perception/cognition), MMBench, SEED-I, VQA-v2, GQA, POPE, testing broad vision-language comprehension.
Embodied AI (SpatialQA-E): 2,000 robot episodes; pick-and-place, obstacle avoidance, ambiguous instructions.

Model accuracy on depth estimation improves from 70.6% (RGB) to 85.8% (RGBD) with Bunny-pretraining, and exceeds 99% following full SpatialQA fine-tuning. Pick-and-place policy success advances by 10–15 percentage points with RGBD inputs versus RGB alone on manipulation tasks. Value additions on general VLM tasks range from +0.9 to +9.1 in score across datasets.

5. Implementation Details and Resources

SpatialBot variants are publicly available, supporting standardized replication and extension:

Backbones: SigLIP (384 $\times$ 384) vision encoder; transformer LLMs (Phi-2 3B, Phi-3 4B, Qwen-1.5 4B, Llama3-8B); CLIP (336 $\times$ 336) in dark/robotic settings.
Optimization Settings: Learning rates—multi-modal projector: $2 \times 10^{-5}$ ; image encoder: $2 \times 10^{-4}$ ; LLM: $1 \times 10^{-4}$ ( $5 \times 10^{-5}$ for larger variants). Training on 8 $\times$ A100 GPUs takes approximately 15 hours for 3B models.
Open Assets: Code and checkpoints on GitHub and HuggingFace:
- https://github.com/BAAI-DCAI/SpatialBot
- hf.co/datasets/RussRobin/SpatialQA, SpatialQA-E, SpatialBench
- hf.co/RussRobin/SpatialBot-3B, full Bunny model zoo

These resources facilitate rapid deployment for evaluation or downstream robotic tasks.

6. Relation to Deep Generative Spatial Models

Earlier architectures for spatial modeling leveraged probabilistic generative frameworks such as Sum-Product Networks (SPNs) to encode joint distributions over geometry and semantics (Pronobis et al., 2016):

Generative Density: Model the joint $p(X, C)$ where $X$ is a vector of spatial cell features and $C$ a semantic class.
Tractable Inference: SPNs enable efficient linear-time upward–downward passes for classification, novelty detection, missing-data imputation, and generative sampling.
Performance: Achieved $\sim$ 92% place classification accuracy (vs. SVM: 85%, GAN: 80%), ROC AUC of 0.96 for novelty detection, and RMSE of 0.05 for imputation.

This suggests that contemporary SpatialBot VLMs complement rather than supplant generative graphical modeling in spatial reasoning, particularly where explicit uncertainty and tractable density estimation are required.

7. Applications and Scope

SpatialBot models are employed in diverse robotics and embodied AI settings:

Visual Question Answering (VQA): Depth-sensitive queries spanning pixel, object, and scene-level reasoning.
Robot Manipulation: Pick-and-place, obstacle avoidance, metric reachability, interpreting ambiguous spatial instructions.
Evaluation: Quantitative comparison of vision-LLMs, depth estimation, counting, and spatial relation inference.

A plausible implication is that the integration of metric depth with high-capacity LLMs advances spatial understanding across both synthetic and real-world datasets, with direct impact on policy learning, perception, and manipulation in embodied systems (Cai et al., 2024, Pronobis et al., 2016).