Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SpatialThinker: A 3D Multimodal Reasoning Framework

Updated 16 November 2025
  • SpatialThinker is a multimodal framework that integrates structured scene graph grounding with chain-of-thought reasoning for 3D spatial question answering.
  • It utilizes fine-grained scene graphs and dense spatial rewards to simulate human-like visual perception and logical spatial deductions.
  • The architecture employs reinforcement learning and scene subgraph extraction, achieving significant improvements in both 2D and 3D VQA benchmarks.

SpatialThinker is a framework for multimodal 3D spatial reasoning, combining structured spatial grounding and multi-step chain-of-thought reasoning within a large vision-LLM (MLLM) by leveraging fine-grained scene graphs, dense spatial rewards, and online reinforcement learning. The architecture is designed to simulate human-like scene perception, explicitly represent spatial relationships, and progressively reason to high-fidelity answers in visual question answering (VQA) tasks involving both 2D and 3D relations (Batra et al., 10 Nov 2025).

1. Model Architecture: Scene Graph Grounding and Multimodal Fusion

SpatialThinker adapts Qwen2.5-VL-3B and Qwen2.5-VL-7B backbones equipped with a patch-based ViT-style visual encoder and autoregressive text decoder. The model operates directly on RGB images, constructing Visual Genome-style scene graphs G=(V,E)G = (V,E) as semantic backbone:

  • Nodes viv_i: object category labels i\ell_i, 2D bounding boxes bi=(x1,y1,x2,y2)b_i = (x_1, y_1, x_2, y_2).
  • Edges eije_{ij}: subject–predicate–object tuples encoding spatial relations (e.g., “near”, “above”, “behind”).

Question-centric scene subgraphs GqG_q are extracted via lemmatized token matching, providing minimal context necessary for each query. The model’s input sequence consists of:

  • <observe>: Signal to extract visual features from image input.
  • <scene>: Explicit JSON serialization of GqG_q (object locations/labels, relationships).
  • : Chain-of-thought prompt for stepwise, logical reasoning.

    • <answer>: Final answer token sequence.

    Image patches and graph tokens are projected into a fused feature space for transformer-based cross-attention. This enables multi-object grounding and spatial relation modeling prior to the reasoning step.

    2. Dataset Construction: STVQA-7K Synthesis Pipeline

    The STVQA-7K dataset underpins spatial reward supervision:

    • QA Generation: Given human-annotated scene graphs (VG150, extended predicates), Claude Sonnet 4 generates multiple-choice spatial questions, possible answers, and answers across nine spatial categories (relations, size, orientation, distance, depth, reach, location, count, existence).
    • Predicate Augmentation: 50 original predicates plus 34 added such as “near,” “beneath,” “facing_away.”
    • Difficulty Filtering: From 56K generated samples, top 10K are selected by rated difficulty and agreement in label predictions (two “blind” GPT-4o checks, pass@2).
    • Scene Graph Alignment: Each sample is adapted to retain only query-relevant nodes and edges. Bounding boxes are kept in pixel coordinates for scale fidelity.
    Step Input Output
    Synthetic QA Scene graph, prompt Q/A/difficulty/labels
    External verify Claude, GPT-4o responses Accept/reject
    Postprocess graph QA, full graph Subgraph GqG_q

    3. Reinforcement Learning Setup: Multi-Objective Dense Reward

    SpatialThinker employs Group-Relative Policy Optimization (GRPO) with a multi-component reward structure:

    • Format Reward (RfmtR_{fmt}): Enforces strict tagging and JSON validity (wfmt=0.1w_{fmt}=0.1).
    • Count Reward (RcntR_{cnt}): Penalizes errors in object/relation numbers (wcnt=0.2w_{cnt}=0.2), computed as:

    Rcnt=0.7max(0,1Npred,objsNgt,objs/Ngt,objs)+0.3max(0,1Npred,relsNgt,rels/Ngt,rels)R_{cnt} = 0.7 \cdot \max(0,1-|N_{pred,objs} - N_{gt,objs}|/N_{gt,objs}) + 0.3\cdot\max(0,1-|N_{pred,rels} - N_{gt,rels}|/N_{gt,rels})

    • Accuracy Reward (RaccR_{acc}): Binary correctness of the final answer (wacc=0.5w_{acc}=0.5).
    • Spatial Reward (RspaR_{spa}): Lexicographically gated (activated only when Racc=1R_{acc}=1), based on Hungarian-matched object pairs, with cost: C(i,j)=1(1IoU(bi,bj))+2(1sim(i,j))C(i,j)=1\cdot(1-\mathrm{IoU}(b_i,b_j))+2\cdot(1-\mathrm{sim}(\ell_i,\ell_j)), and averaged CIoU across matched pairs (wspa=0.2w_{spa}=0.2).
    • Reward formula (per trajectory yy):

    Rtotal(y)=I[Rfmt=1](wfmtRfmt+wcntRcnt+waccRacc+I[Racc=1]wspaRspa)R_{total}(y) = I[R_{fmt}=1]\cdot(w_{fmt}R_{fmt} + w_{cnt}R_{cnt} + w_{acc}R_{acc} + I[R_{acc}=1]w_{spa}R_{spa})

    • Optimization hyperparameters: PPO clipping (ε=0.2\varepsilon=0.2), KL penalty (β=1e-2\beta=1\text{e-}2), rollout size (N=8N=8), context window (16,38416{,}384 tokens), batch size ($512$), learning rate (1e-61\text{e-}6), weight decay (1e-21\text{e-}2)

    4. Spatial Reasoning Pipeline and Mechanisms

    For each QA at inference, model proceeds:

    1. <observe>: Extract patch-features from resized image (512×512512\times5122048×20482048\times2048).
    2. <scene>: Predict question-aligned scene graph, output in JSON.
    3. <think>: Use chain-of-thought template to assemble visual cues, invoke common-sense geometric priors, and perform logical deductions.
    4. <answer>: Emit a short single-token or span answer.

    Cross-attention ensures that the reasoning chain is grounded in the spatial configuration of the parsed scene graph, with iterative attention over objects and predicates. Dense rewards reinforce that outputs maintain grounding, quantitative correctness, and spatial coherence.

    5. Quantitative Results and Comparative Performance

    SpatialThinker-7B demonstrated strong improvements over prior models and baselines:

    • CV-Bench (2D+3D avg accuracy): 78.2% (SpatialThinker-7B, RL) vs. 68.6% (Qwen2.5-VL-7B, base) vs. 79.4% (GPT-4o).
    • 3DSRBench (Orientation, 3D relations): 56.4% vs. GPT-4o (44.3%).
    • BLINK: 79.3% vs. GPT-4o (80.4%).
    • General VQA: 71.2% zero-shot across six real-world and six spatial benchmarks vs. GPT-4o (67.8%), Claude (61.1%).
    • Ablation (STVQA-7K val): Format+accuracy reward only: 74.9%; with fully gated count/spatial reward: 76.3%; filtered dataset: 87.9%.
    • OOD generalization: +7.2% over base (spatial), +5.2% (real-world), outperforming vanilla RL and SFT variants.

    6. Limitations, Ablations, and Guidance for Future SpatialThinker Design

    Observed bottlenecks:

    • Exact grounding relies on correct scene graph extraction; visual ambiguity can propagate errors.
    • Reward hacking surfaces when count/spatial components lack gating—forcing their activation conditional on format/accuracy mitigates collapse.
    • Lexicographic gating and robust data filtering are critical to prevent policy drift and exploitability in RL settings.

    Guidance for future extensions:

    • Maintain explicit multimodal fusion via scene graph tokens, processed jointly with image patches for all reasoning steps.
    • Gate spatial reward activation tightly on correct answer predictions.
    • Scale data synthesis pipeline with broader predicate sets and additional hard negatives.
    • Adopt flexible, context-dependent subgraph representation to ensure composability with diverse spatial queries.
    • Integrate dynamic region proposals or fine-grained spatial attention for improved resolution sensitivity.

    7. Significance and Current Impact

    SpatialThinker represents a convergent state-of-the-art paradigm for 3D-aware visual question answering in MLLMs, with RL-aligned reasoning and explicit scene grounding. Its dense spatial reward structure and use of scene graphs as a multimodal substrate distinguish it from prior art relying solely on text or coarse bounding-box cues (Batra et al., 10 Nov 2025). Performance gains over supervised and sparse RL baselines on both in-domain and out-of-domain spatial VQA attest to the efficacy of this reward-driven approach. The framework’s modular prompt and reward gating elements suggest a design blueprint for next-generation spatial reasoning agents compatible with limited-scale data regimes and task-adaptive RL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SpatialThinker.