Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpatialMosaicVLM Architecture

Updated 25 May 2026
  • The paper introduces SpatialMosaicVLM, which fuses visual and geometric tokens via cross-attention for enhanced multi-view spatial reasoning.
  • It employs a transformer-based geometric reconstructor (VGGT) alongside a frozen CLIP encoder and an LLM to efficiently integrate explicit 3D priors.
  • Evaluations on the SpatialMosaic benchmark demonstrate its scalability and robustness in handling partial visibility, occlusion, and low-overlap conditions.

SpatialMosaicVLM is a hybrid Vision-LLM (VLM) architecture designed for robust spatial reasoning from multi-view images, with a strong emphasis on partial visibility, occlusion, and fragmented visual cues. It combines a transformer-based geometric reconstructor—VGGT (Visual Geometry Grounded Transformer)—with a frozen CLIP image encoder and a LLM, implementing a cross-attention-based token fusion mechanism to integrate explicit 3D priors for scene understanding. It is introduced in the context of the SpatialMosaic dataset and benchmark, offering a scalable, multiview, instruction-tuning framework for multimodal reasoning in challenging and realistic 3D scenarios (Lee et al., 29 Dec 2025).

1. Model Pipeline and Dataflow

The SpatialMosaicVLM pipeline processes VV multi-view RGB images of an indoor scene (V=25V=2{-}5, each 518×518518{\times}518 px). Both geometric and visual features are extracted in parallel:

  • Visual encoding: Every input image IvI_v is normalized, patch-embedded into 32×32=102432{\times}32=1024 patches (each 16×1616{\times}16 px), and encoded by a frozen CLIP ViT-B/16 to yield Fvis(v)R1024×768F_\text{vis}^{(v)} \in \mathbb{R}^{1024 \times 768}. All views are concatenated into FvisRTvis×dF_\text{vis} \in \mathbb{R}^{T_\text{vis} \times d}.
  • Geometric encoding: The frozen VGGT module simultaneously processes the VV views, producing a set of spatial tokens FspaRTspa×dF_\text{spa} \in \mathbb{R}^{T_\text{spa} \times d} (with V=25V=2{-}50) and V=25V=2{-}51 camera-specific tokens V=25V=2{-}52, forming V=25V=2{-}53.
  • Cross-modal fusion: A single multi-head cross-attention layer (8 heads, head-dim 96) projects V=25V=2{-}54 (queries) and V=25V=2{-}55 (keys/values) with learned projections V=25V=2{-}56; output is

V=25V=2{-}57

where V=25V=2{-}58. Softmax is taken over the geometric-token axis.

  • Projection and LLM integration: V=25V=2{-}59 passes through a two-layer MLP (hidden dim 518×518518{\times}5180 with GELU), producing 518×518518{\times}5181. The text question is tokenized and embedded via the LLM’s tokenizer, yielding 518×518518{\times}5182. The sequence 518×518518{\times}5183 is fed into a (frozen or lightly tuned) autoregressive LLM decoder (typically LLaVA-Next-Video 7B parameter model).
  • Output: The LLM outputs either a free-form string or logits for multiple-choice QA (typically 4-way or binary).

A concise module and data-flow mapping is summarized below:

Module Input Output
Pre-process raw RGB 518×518518{\times}5184 normalized patch images
E_vis (CLIP) patch images 518×518518{\times}5185
E_geo (VGGT) patch images 518×518518{\times}5186
Cross-Attn 518×518518{\times}5187 518×518518{\times}5188
MLP Projector 518×518518{\times}5189 IvI_v0
Token Concat IvI_v1 LLM input tokens
LLM Decoder tokens answer logits

This configuration ensures efficient injection of geometric priors into every visual token before language-based reasoning.

2. Geometry Encoder: VGGT

The geometry encoder module is instantiated as VGGT (Visual Geometry Grounded Transformer), a vision transformer tailored for multi-view 3D reconstruction:

  • Architecture: VGGT consists of 12 alternating layers implementing (i) self-attention among all patch tokens, (ii) cross-view epipolar attention for geometric correspondence, and (iii) MLP feed-forward blocks with hidden dimension IvI_v2.
  • Multi-view aggregation: Explicit epipolar constraints are maintained—patches in view IvI_v3 only attend to geometrically corresponding locations in view IvI_v4 along predicted epipolar lines.
  • Tokenization: At the final layer, VGGT outputs spatial “point” tokens IvI_v5 representing aggregated representations in 3D, as well as IvI_v6 camera tokens IvI_v7.
  • Output interface: Output tokens IvI_v8 are consumed during cross-modal attention fusion.

VGGT is always used in frozen form, leveraging its pre-trained geometric priors while keeping downstream adaptation lightweight.

3. Visual-Language Fusion and Downstream Reasoning

SpatialMosaicVLM integrates its geometric and visual representations via cross-attention token fusion, followed by instruction-driven reasoning:

  • Visual backbone: CLIP ViT-B/16 (frozen) provides IvI_v9-dimensional patchwise embedding for each image.
  • Language backbone: LLaVA-Next-Video (7B parameters), deployed in frozen or adapter-tuned form, receives a mixed sequence of projected fused visual tokens and question embeddings.
  • Cross-attention fusion layer: This layer enables every visual patch token to explicitly aggregate geometry priors from all geometric tokens, with learned projections (32×32=102432{\times}32=10240, 32×32=102432{\times}32=10241, 32×32=102432{\times}32=10242) and softmax along the geometry axis.
  • MLP projection: The output of the fusion (32×32=102432{\times}32=10243) passes through a two-layer MLP (GELU activation, hidden dim 32×32=102432{\times}32=10244) before being concatenated with text tokens.
  • LLM integration: Sequence 32×32=102432{\times}32=10245 is processed by the LLM, yielding answer logits for QA tasks.

This design supports robust spatial reasoning, especially under partial visibility, occlusion, and low-overlap conditions, outperforming prior VLMs constrained by explicit 3D reconstructions or fragmented off-the-shelf pipelines (Lee et al., 29 Dec 2025).

4. Training Procedures and Objectives

Training is limited to a lightweight set of modules:

  • Frozen components: Both the visual encoder (32×32=102432{\times}32=10246, CLIP) and geometry encoder (32×32=102432{\times}32=10247, VGGT) remain frozen.
  • Trainable parameters: Only the cross-attention projections, two-layer MLP projector, and LLM adapters (if used) are updated, totaling approx. 100 million parameters.
  • Optimization:
    • Loss: Standard categorical cross-entropy over the answer logits,

    32×32=102432{\times}32=10248

    where 32×32=102432{\times}32=10249 is the number of choices (4 for multiple choice, 2 for binary). - No auxiliary loss for geometry or contrast is used: geometric priors are enforced exclusively by the frozen encoder.

  • Training setup:

    • Batch size = 4 per GPU on 8 × NVIDIA H200.
    • Optimizer: AdamW with zero weight decay, DeepSpeed ZeRO Stage 2.
    • Learning rate: 16×1616{\times}160, cosine decay over 5 epochs.
    • Dataset: Up to 2 million QA pairs (full run), 200K QA pairs for prototyping.

This minimal-supervision scheme validates that robust multi-view spatial reasoning can be achieved through explicit 3D geometry fusion with lightweight instruction tuning (Lee et al., 29 Dec 2025).

5. Implementation Architecture and Hyperparameters

SpatialMosaicVLM’s configuration is as follows:

  • Input image: 16×1616{\times}161 resolution, patch size 16×1616{\times}162, yielding 16×1616{\times}163 patches per view.
  • Visual embedding dimension: 16×1616{\times}164.
  • VGGT outputs: 16×1616{\times}165 spatial tokens plus 16×1616{\times}166 camera tokens; 16×1616{\times}167 thus has shape 16×1616{\times}168.
  • Transformer layers: CLIP ViT-B/16 and VGGT both use 12 layers (frozen); the cross-attention fusion block is a single-layer transformer with 8 heads.
  • LLM: 7B parameter LLaVA-Next-Video.
  • Trainable parameter count: 16×1616{\times}169100M for all cross-modal and fusion modules.
  • Training hardware: 8 × NVIDIA H200 GPUs, batch size 4 per GPU.
  • Learning rate and schedule: Fvis(v)R1024×768F_\text{vis}^{(v)} \in \mathbb{R}^{1024 \times 768}0, cosine decay.
  • Data: Up to 2M training QA pairs for final runs.

This architecture allows efficient scaling for large instruction-tuned datasets and supports a wide range of scene complexities and visibility conditions.

6. Capabilities, Evaluation, and Significance

SpatialMosaicVLM demonstrates key capabilities for 3D spatial reasoning:

  • Multi-view and occlusion robustness: By integrating explicit geometry via VGGT, the system maintains strong performance in situations of heavy occlusion, low overlap, and fragmented visual evidence.
  • Benchmarks: The associated SpatialMosaic-Bench covers six spatial reasoning tasks, each constructed to evaluate challenging partial visibility and occlusion scenarios. Results indicate that the architecture, trained with the SpatialMosaic dataset, outperforms baselines lacking explicit 3D priors (Lee et al., 29 Dec 2025).
  • Data flow and task generality: The modular pipeline supports straightforward extension to additional views or different types of instruction-driven QA, owing to its token-based interface and frozen backbones.
  • Instruction-tuning advances: The framework leverages the scale and diversity of the SpatialMosaic dataset (2M QA pairs), enabling comprehensive evaluation of multi-view VLMs in realistic environments.

A plausible implication is that explicitly fusing geometric tokens from a transformer-based reconstructor with visual and language modalities establishes a scalable paradigm for spatial visual reasoning, with direct application to robotics, autonomous navigation, and embodied question answering.


For further implementation details and benchmarking results, see "SpatialMosaic: A Multiview VLM Dataset for Partial Visibility" (Lee et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialMosaicVLM Architecture.