SpatialMosaic-Bench: 3D Scene Reasoning

Updated 25 January 2026

SpatialMosaic-Bench is a benchmark dataset that evaluates vision-language models on 3D spatial reasoning using sparse, occluded multi-view imagery.
It integrates a hybrid vision-language architecture with instruction-tuning techniques to leverage detailed geometric and visual cues.
The benchmark features six tailored tasks with comprehensive annotations, enabling rigorous analysis of model performance under partial visibility.

SpatialMosaic is a large-scale instruction-tuning dataset designed to advance vision-LLMs (VLMs) for 3D scene understanding and spatial reasoning, with a particular emphasis on real-world challenges such as partial visibility, occlusion, and sparse multi-view cues. Constructed atop high-fidelity RGB-D and 3D semantic meshes from ScanNet++ indoor environments, SpatialMosaic enables VLMs to reason over fragmented visual information without reliance on explicit 3D reconstruction pipelines. The dataset comprises comprehensive visibility and occlusion annotations, a robust multi-view question-answer (QA) generation pipeline, and an associated benchmark (SpatialMosaic-Bench) for evaluating spatial reasoning capabilities across six core task types. The work introduces the SpatialMosaicVLM hybrid architecture, integrating state-of-the-art 3D reconstruction transformers with vision-language encoders through cross-attention, demonstrating substantial performance gains on challenging spatial reasoning benchmarks (Lee et al., 29 Dec 2025).

1. Data Collection and Annotation Pipeline

SpatialMosaic leverages ScanNet++ (Yeshwanth et al. 2023), selecting 849 real-world indoor scenes (679 for training, 170 for evaluation) and sampling multi-view combinations of 2–5 images per scene. Crucially, sampled views are constrained such that their 3D-point overlap does not exceed 30%, defined via the overlap ratio:

$\text{Overlap}(i,j) = \frac{|V^i \cap V^j|}{|V^i \cup V^j|},\quad V^i=\bigcup_n\{3D\,\text{points visible of instance }n\text{ in view }i\}$

This enforces the use of complementary, non-redundant cues across views. Objects in each view are rigorously annotated with two metrics:

Object-level occlusion ratio $r_{n,obj}$ (Eq. 2): Fraction of an object’s visible points obscured by other geometry.
Field-of-View occlusion ratio $r_{n,FoV}$ (Eq. 5): Fraction of object lying outside the image crop, operationalized by rendering into an extended $2H\times2W$ frame (Eq. 3–4).

Objects fully visible in all selected views or with >90% occlusion across views are excluded, ensuring the dataset targets realistic “fragmented” partial-visibility regimes.

2. Question-Answer Generation and Task Taxonomy

For each filtered multi-view combination:

Collect all object instances visible in at least one, but not all, views.
Transform 3D bounding boxes into the camera coordinates of a randomly selected "query" view via $v^{(w)} \to v^{(c)} = R_{wc}(v - t_{wc})$ .
Generate QA pairs using task-specific templates and distractors formed by opposite or orthogonal spatial relations.

The SpatialMosaic dataset supports six explicit spatial reasoning and occlusion-aware tasks:

Task	Response Type	Question Example
Object Count	4-way multiple choice	"How many chairs across these frames?"
Best-View Selection	4-way multiple choice	"Which frame gives the most informative view of the table?"
Object Localization	Binary + bbox coordinates	"Is there a monitor in Frame 1? If so, what are its bbox center coordinates?"
Occlusion-Aware Existence	Binary	"In Frame 3, is the mouse farther from the camera than the laptop?"
Occlusion-Aware Attribute	4-way single answer	"In Frame 2, which object appears lower than the table?"
Spatial Relation	4-way multiple choice	"Where is X relative to Y?"

A plausible implication is that the rigor of the QA pipeline creates systematically diverse and challenging spatial reasoning scenarios.

3. Dataset Composition and Benchmarking

SpatialMosaic comprises approximately 8 million images in the training split, supporting 2 million QA pairs for training and 1 million for evaluation via SpatialMosaic-Bench. Each QA sample is constructed from 2–5 sparsely overlapping views, enabling the study of fragmented visual cues and partial coverage. The evaluation benchmark annotates each QA with visibility scenario (fully vs. partially visible targets) and ground-truth coverage (full vs. partial category instance coverage).

Evaluation metrics reported include:

Accuracy: Correct responses for multiple choice and binary tasks.
(Optional) F1 score: Token-level for binary tasks; not reported due to absence of significant class imbalance.
Spatial consistency: Comparison of predicted spatial relations to ground-truth 3D box-derived relations.

Per-task distribution in the training split is approximately uniform at ~333,000 QA pairs per task.

4. SpatialMosaicVLM Architecture

SpatialMosaicVLM is a hybrid framework integrating two parallel per-view encoders:

Visual encoder ( $E_{vis}$ ): CLIP ViT model outputs patch tokens $F_{vis}^{(v)}\in\mathbb{R}^{T_{vis}\times d}$ .
Geometry encoder ( $E_{geo}$ ): VGGT 2025 architecture outputs multi-view spatial tokens $F_{spa}$ and camera tokens $z$ , combined to $F_{geo}\in\mathbb{R}^{(T_{spa}+V)\times d}$ .

These representations are fused via cross-attention:

$F_{fuse} = \mathrm{softmax}\left((F_{vis}W_q)(F_{geo}W_k)^T/\sqrt{d_k}\right) (F_{geo}W_v)$

where $W_q, W_k, W_v$ are learnable projection matrices. The fused tokens are projected by a two-layer MLP, concatenated with question tokens, and input to a LLM (e.g., LLaVA Next). The primary training objective is QA cross-entropy loss; encoders are frozen, and no explicit geometry consistency loss is applied.

5. Experimental Results and Performance Analysis

SpatialMosaicVLM demonstrates strong spatial reasoning under challenging multi-view and occlusion scenarios. On SpatialMosaic-Bench:

Best open-source VLM baseline (LLaVA-NeXT-Video-7B): 47.8% average accuracy.
VLM-3R* fine-tuned: 49.3% average.
VLM-3R (re-implemented) fine-tuned: 81.7% average.
SpatialMosaicVLM (7B) fine-tuned: 81.8% average, yielding ≈34 percentage point improvement over zero-tuned baselines.

Zero-shot transfer performance is established on VSTI-Bench, with SpatialMosaicVLM achieving 46.8% versus best baseline 44.0%, indicating robust generalization to camera-centric and temporal tasks.

Ablation studies reveal:

Performance correlates with visibility and overlap metrics: Accuracy degrades smoothly as object occlusion or view overlap increases, verifying these metrics as predictors of difficulty.
Under “Partially Visible + Partial Coverage” settings: Baseline models drop >20 points, while SpatialMosaicVLM maintains >65% accuracy on most tasks.
Geometry encoder ablation: Using only visual tokens (CLIP) reduces accuracy by ≈12 points, confirming the critical role of explicit multi-view 3D cues.

6. Significance and Implications

SpatialMosaic exemplifies a scalable paradigm for instruction-tuning VLMs in 3D environments with partial observability, sparse context, and strong occlusion. The annotated metrics—object-level occlusion ratio $r_{n,obj}$ , field-of-view ratio $r_{n,FoV}$ , and multi-view overlap—systematically characterize the complexity and enable precise control of spatial reasoning difficulty. The hybrid architecture confirms that fine-grained geometry tokens, fused with visual representations, substantially raise the robustness and generalization of VLMs for spatial tasks. This suggests promising avenues for studying VLMs in real-world partial-observability regimes and deploying them in robotics, autonomous systems, and interactive agents facing occlusions and fragmented views (Lee et al., 29 Dec 2025).

Markdown Upgrade to Chat

References (1)

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialMosaic-Bench.