SpatialMosaic: Multi-View 3D Reasoning
- SpatialMosaic is a large-scale dataset and benchmark designed to evaluate multi-view spatial reasoning with fragmented, low-overlap images under challenging occlusion and partial visibility conditions.
- It employs a detailed annotation pipeline using ScanNet++ RGB-D data to generate millions of QA pairs by enforcing stringent spatial and visibility constraints.
- Experimental results show that SpatialMosaicVLM achieves high accuracy improvements (up to 81.8%) by effectively integrating 3D geometric cues with vision-language models in complex real-world scenarios.
SpatialMosaic refers to both a general class of methodologies for integrating multiple spatially referenced images, signals, or sensor measurements—which may only partially overlap—into a unified representation, and to several specific frameworks or datasets that operationalize and benchmark multi-view spatial reasoning and image stitching. This entry focuses on the SpatialMosaic dataset and benchmark for vision-LLMs, its role in spatial reasoning under partial visibility, and contextualizes it within broader spatial-mosaicking paradigms in computer vision, remote sensing, and computational imaging.
1. Core Concepts and Definition
SpatialMosaic is a large-scale, instruction-tuning dataset and benchmark designed to probe multi-view 3D spatial reasoning under challenging real-world conditions: partial visibility, inter-object occlusion, and low camera-view overlap. It consists of 2 million multi-view question-answer pairs for training and a 1 million–QA benchmark (SpatialMosaic-Bench) for evaluation, all based on rich RGB-D scans and 3D object annotations. Each sample comprises 2 to 5 views, a structured natural language spatial reasoning question, and multiple-choice or binary answer options. The defining characteristic is the forced integration of fragmented visual cues across views with intentionally low mutual coverage (maximum pairwise overlap <30%) and pervasive occlusion or truncation (Lee et al., 29 Dec 2025).
2. Dataset Construction and Annotation Pipeline
SpatialMosaic's generation pipeline leverages ScanNet++ RGB-D data and per-instance 3D meshes to automate frame selection, per-object visibility, occlusion metric computation, and spatial relationship annotation. The pipeline operates in three major stages:
- Data Preparation:
- For each 3D object instance, depth maps ( for the object, for the scene) determine the sets of points occluded () and visible () per view:
The object-occlusion ratio is .
- Field-of-View truncation assesses how much of the object is outside the current camera's view using an extended intrinsics matrix for a canvas. The FoV-occlusion ratio is , with truncated points.
QA Generation:
- Multi-view frame sets of 2–5 are sampled, enforcing low overlap:
- QA templates are filled using automatically derived visibility, occlusion, and spatial relations, such as oriented bounding-box–based discrete X/Y/Z relations.
- Each QA is labeled into one of six tasks: Object Count, Best-View Selection, Object Localization, Occlusion-Aware Existence, Occlusion-Aware Attribute, and Occlusion-Aware Spatial Relation. Both binary and 4-way multiple-choice answer formats are included.
- Dataset Balancing and Output:
- After curation, ~3 million raw QA pairs are pruned and balanced to yield 2M training and 1M eval samples, annotated for both coverage and visibility scenarios.
3. Task Design and Benchmark Structure
SpatialMosaic-Bench defines six spatial reasoning tasks with diagnostic visibility and coverage codes:
| Task Name | Output Type | Reasoning Challenge |
|---|---|---|
| Object Count | 4-way MCA | Multi-view aggregation, partial-count |
| Best-View Selection | 4-way MCA | Visibility/area maximization |
| Object Localization | 4-way MCA | Occlusion-robust 2D detection |
| Occlusion-Aware Existence | Binary | Depth ordering under occlusion |
| Occlusion-Aware Attribute | 4-way MCA | Attribute binary relation under blocking |
| Occlusion-Aware Relation | 4-way MCA | Left/right/front/behind under fragmentary views |
In each, models must reason across non-redundant low-overlap views with varying object-level occlusion and coverage. Questions and distractors are structurally generated to require spatial integration, for example, "How many chairs are visible across these four frames?" or "In Frame 4, where does the electric piano appear relative to the hand sanitizer?" (Lee et al., 29 Dec 2025).
4. Vision-LLM Architecture: SpatialMosaicVLM
SpatialMosaicVLM is a hybrid vision-LLM architecture tailored to the dataset:
- Input: Set of RGB images ( px) and a question.
- Visual Encoder : CLIP-ViT backbone, producing patch tokens .
- Geometry Encoder : Pretrained reconstruction transformer, outputting spatial tokens and camera tokens . Fused input .
- Cross-Attention Fusion:
Fused features progress through a 2-layer MLP and are concatenated with question tokens, then input to a frozen LLM (e.g., LLaVA-Next).
- Auxiliary coordinate transform loss can be added for grounding, but primary supervision is VQA accuracy.
This hybridization enables explicit modeling of cross-view geometric consistency, outperforming pure 2D approaches under occlusion and partial-visibility constraints (Lee et al., 29 Dec 2025).
5. Experimental Results and Insights
Empirical results on SpatialMosaic-Bench show the importance of 3D geometry fusion and dataset scale. Notable findings:
- Quantitative Benchmarking:
- SpatialMosaicVLM (7B) achieves 81.8% average accuracy, outperforming all open-source baselines (e.g., LLaVA, VILA, LongVA <50%).
- Ablation: Removing the geometry encoder reduces accuracy by 6.6 percentage points; training on the full dataset yields a +20.5 point gain over 10% scale.
- Removing partial visibility cases or low-overlap sampling degrades hard-case performance (up to −18% for occlusion scenarios).
- Zero-Shot Transfer:
- Models pretrained on SpatialMosaic show 46.8% average accuracy on temporal reasoning tasks (e.g., camera-object distance or movement direction) even without further tuning, outperforming larger LLM baselines, suggesting robust spatial priors are induced.
- Qualitative Behavioral Differences:
- SpatialMosaicVLM successfully counts visible objects without over-counting occluded ones and resolves spatial relations under ambiguous occlusions, whereas 2D-only models often default to statistically common or spurious answers.
6. Significance in the Broader Spatial Reasoning and Mosaicking Landscape
SpatialMosaic operationalizes a rigorous, scaleable approach to benchmarking spatial reasoning under real-world viewing constraints not addressed by prior datasets reliant on fully visible or densely overlapping field coverage. Unlike datasets based on precomputed 3D reconstructions, SpatialMosaic's annotation and task generation directly encode partial-visibility, occlusion, and low overlap, mirroring practical multi-camera and robotic scenarios.
The hybrid VLM approach in SpatialMosaicVLM advances beyond simple visual aggregation by leveraging geometric tokens from pretrained 3D reconstruction backbones, imparting consistent spatial reasoning capabilities that standard vision-LLMs lack. This scaffolds the development and assessment of multi-view, 3D-aware LLMs for embodied AI, robotics, and spatially grounded VQA (Lee et al., 29 Dec 2025).
7. Practical Access, Usage, and Fine-Tuning Recommendations
SpatialMosaic and SpatialMosaic-Bench, along with the SpatialMosaicVLM model and ScanNet++–derived multi-view inputs, are released publicly. Recommended fine-tuning and usage parameters include:
- GPU configuration: 8×H200, batch size 4, learning rate with cosine scheduler, ZeRO Stage 2 optimizer; five training epochs correspond to approximately 16 hours for the full 2M QA dataset.
- During fine-tuning, it is suggested to freeze both visual and geometry encoders, with trainable fusion and LLM heads.
- Inference requires feeding 2–5 RGB frames and a question, returns a selected answer option or generated text; average latency is 0.8s/sample on A100 hardware. The default input structure, diagnostic annotations, and preprocessed resources are available at the provided repository (Lee et al., 29 Dec 2025).
By integrating a scalable low-overlap multi-view QA pipeline, richly annotated spatial challenges, and an explicit 3D geometry–aware modeling strategy, SpatialMosaic sets a new standard for evaluating and advancing spatial reasoning in vision-LLMs under realistic, fragmented-view conditions.