SpatialMosaicVLM Architecture

Updated 25 May 2026

The paper introduces SpatialMosaicVLM, which fuses visual and geometric tokens via cross-attention for enhanced multi-view spatial reasoning.
It employs a transformer-based geometric reconstructor (VGGT) alongside a frozen CLIP encoder and an LLM to efficiently integrate explicit 3D priors.
Evaluations on the SpatialMosaic benchmark demonstrate its scalability and robustness in handling partial visibility, occlusion, and low-overlap conditions.

SpatialMosaicVLM is a hybrid Vision-LLM (VLM) architecture designed for robust spatial reasoning from multi-view images, with a strong emphasis on partial visibility, occlusion, and fragmented visual cues. It combines a transformer-based geometric reconstructor—VGGT (Visual Geometry Grounded Transformer)—with a frozen CLIP image encoder and a LLM, implementing a cross-attention-based token fusion mechanism to integrate explicit 3D priors for scene understanding. It is introduced in the context of the SpatialMosaic dataset and benchmark, offering a scalable, multiview, instruction-tuning framework for multimodal reasoning in challenging and realistic 3D scenarios (Lee et al., 29 Dec 2025).

1. Model Pipeline and Dataflow

The SpatialMosaicVLM pipeline processes $V$ multi-view RGB images of an indoor scene ( $V=2{-}5$ , each $518{\times}518$ px). Both geometric and visual features are extracted in parallel:

Visual encoding: Every input image $I_v$ is normalized, patch-embedded into $32{\times}32=1024$ patches (each $16{\times}16$ px), and encoded by a frozen CLIP ViT-B/16 to yield $F_\text{vis}^{(v)} \in \mathbb{R}^{1024 \times 768}$ . All views are concatenated into $F_\text{vis} \in \mathbb{R}^{T_\text{vis} \times d}$ .
Geometric encoding: The frozen VGGT module simultaneously processes the $V$ views, producing a set of spatial tokens $F_\text{spa} \in \mathbb{R}^{T_\text{spa} \times d}$ (with $V=2{-}5$ 0) and $V=2{-}5$ 1 camera-specific tokens $V=2{-}5$ 2, forming $V=2{-}5$ 3.
Cross-modal fusion: A single multi-head cross-attention layer (8 heads, head-dim 96) projects $V=2{-}5$ 4 (queries) and $V=2{-}5$ 5 (keys/values) with learned projections $V=2{-}5$ 6; output is

$V=2{-}5$ 7

where $V=2{-}5$ 8. Softmax is taken over the geometric-token axis.

Projection and LLM integration: $V=2{-}5$ 9 passes through a two-layer MLP (hidden dim $518{\times}518$ 0 with GELU), producing $518{\times}518$ 1. The text question is tokenized and embedded via the LLM’s tokenizer, yielding $518{\times}518$ 2. The sequence $518{\times}518$ 3 is fed into a (frozen or lightly tuned) autoregressive LLM decoder (typically LLaVA-Next-Video 7B parameter model).
Output: The LLM outputs either a free-form string or logits for multiple-choice QA (typically 4-way or binary).

A concise module and data-flow mapping is summarized below:

Module	Input	Output
Pre-process	raw RGB $518{\times}518$ 4	normalized patch images
E_vis (CLIP)	patch images	$518{\times}518$ 5
E_geo (VGGT)	patch images	$518{\times}518$ 6
Cross-Attn	$518{\times}518$ 7	$518{\times}518$ 8
MLP Projector	$518{\times}518$ 9	$I_v$ 0
Token Concat	$I_v$ 1	LLM input tokens
LLM Decoder	tokens	answer logits

This configuration ensures efficient injection of geometric priors into every visual token before language-based reasoning.

2. Geometry Encoder: VGGT

The geometry encoder module is instantiated as VGGT (Visual Geometry Grounded Transformer), a vision transformer tailored for multi-view 3D reconstruction:

Architecture: VGGT consists of 12 alternating layers implementing (i) self-attention among all patch tokens, (ii) cross-view epipolar attention for geometric correspondence, and (iii) MLP feed-forward blocks with hidden dimension $I_v$ 2.
Multi-view aggregation: Explicit epipolar constraints are maintained—patches in view $I_v$ 3 only attend to geometrically corresponding locations in view $I_v$ 4 along predicted epipolar lines.
Tokenization: At the final layer, VGGT outputs spatial “point” tokens $I_v$ 5 representing aggregated representations in 3D, as well as $I_v$ 6 camera tokens $I_v$ 7.
Output interface: Output tokens $I_v$ 8 are consumed during cross-modal attention fusion.

VGGT is always used in frozen form, leveraging its pre-trained geometric priors while keeping downstream adaptation lightweight.

3. Visual-Language Fusion and Downstream Reasoning

SpatialMosaicVLM integrates its geometric and visual representations via cross-attention token fusion, followed by instruction-driven reasoning:

Visual backbone: CLIP ViT-B/16 (frozen) provides $I_v$ 9-dimensional patchwise embedding for each image.
Language backbone: LLaVA-Next-Video (7B parameters), deployed in frozen or adapter-tuned form, receives a mixed sequence of projected fused visual tokens and question embeddings.
Cross-attention fusion layer: This layer enables every visual patch token to explicitly aggregate geometry priors from all geometric tokens, with learned projections ( $32{\times}32=1024$ 0, $32{\times}32=1024$ 1, $32{\times}32=1024$ 2) and softmax along the geometry axis.
MLP projection: The output of the fusion ( $32{\times}32=1024$ 3) passes through a two-layer MLP (GELU activation, hidden dim $32{\times}32=1024$ 4) before being concatenated with text tokens.
LLM integration: Sequence $32{\times}32=1024$ 5 is processed by the LLM, yielding answer logits for QA tasks.

This design supports robust spatial reasoning, especially under partial visibility, occlusion, and low-overlap conditions, outperforming prior VLMs constrained by explicit 3D reconstructions or fragmented off-the-shelf pipelines (Lee et al., 29 Dec 2025).

4. Training Procedures and Objectives

Training is limited to a lightweight set of modules:

Frozen components: Both the visual encoder ( $32{\times}32=1024$ 6, CLIP) and geometry encoder ( $32{\times}32=1024$ 7, VGGT) remain frozen.
Trainable parameters: Only the cross-attention projections, two-layer MLP projector, and LLM adapters (if used) are updated, totaling approx. 100 million parameters.
Optimization:
- Loss: Standard categorical cross-entropy over the answer logits,
$32{\times}32=1024$ 8

where $32{\times}32=1024$ 9 is the number of choices (4 for multiple choice, 2 for binary). - No auxiliary loss for geometry or contrast is used: geometric priors are enforced exclusively by the frozen encoder.
Training setup:
- Batch size = 4 per GPU on 8 × NVIDIA H200.
- Optimizer: AdamW with zero weight decay, DeepSpeed ZeRO Stage 2.
- Learning rate: $16{\times}16$ 0, cosine decay over 5 epochs.
- Dataset: Up to 2 million QA pairs (full run), 200K QA pairs for prototyping.

This minimal-supervision scheme validates that robust multi-view spatial reasoning can be achieved through explicit 3D geometry fusion with lightweight instruction tuning (Lee et al., 29 Dec 2025).

5. Implementation Architecture and Hyperparameters

SpatialMosaicVLM’s configuration is as follows:

Input image: $16{\times}16$ 1 resolution, patch size $16{\times}16$ 2, yielding $16{\times}16$ 3 patches per view.
Visual embedding dimension: $16{\times}16$ 4.
VGGT outputs: $16{\times}16$ 5 spatial tokens plus $16{\times}16$ 6 camera tokens; $16{\times}16$ 7 thus has shape $16{\times}16$ 8.
Transformer layers: CLIP ViT-B/16 and VGGT both use 12 layers (frozen); the cross-attention fusion block is a single-layer transformer with 8 heads.
LLM: 7B parameter LLaVA-Next-Video.
Trainable parameter count: $16{\times}16$ 9100M for all cross-modal and fusion modules.
Training hardware: 8 × NVIDIA H200 GPUs, batch size 4 per GPU.
Learning rate and schedule: $F_\text{vis}^{(v)} \in \mathbb{R}^{1024 \times 768}$ 0, cosine decay.
Data: Up to 2M training QA pairs for final runs.

This architecture allows efficient scaling for large instruction-tuned datasets and supports a wide range of scene complexities and visibility conditions.

6. Capabilities, Evaluation, and Significance

SpatialMosaicVLM demonstrates key capabilities for 3D spatial reasoning:

Multi-view and occlusion robustness: By integrating explicit geometry via VGGT, the system maintains strong performance in situations of heavy occlusion, low overlap, and fragmented visual evidence.
Benchmarks: The associated SpatialMosaic-Bench covers six spatial reasoning tasks, each constructed to evaluate challenging partial visibility and occlusion scenarios. Results indicate that the architecture, trained with the SpatialMosaic dataset, outperforms baselines lacking explicit 3D priors (Lee et al., 29 Dec 2025).
Data flow and task generality: The modular pipeline supports straightforward extension to additional views or different types of instruction-driven QA, owing to its token-based interface and frozen backbones.
Instruction-tuning advances: The framework leverages the scale and diversity of the SpatialMosaic dataset (2M QA pairs), enabling comprehensive evaluation of multi-view VLMs in realistic environments.

A plausible implication is that explicitly fusing geometric tokens from a transformer-based reconstructor with visual and language modalities establishes a scalable paradigm for spatial visual reasoning, with direct application to robotics, autonomous navigation, and embodied question answering.

For further implementation details and benchmarking results, see "SpatialMosaic: A Multiview VLM Dataset for Partial Visibility" (Lee et al., 29 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialMosaicVLM Architecture.