Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 70 tok/s
Gemini 2.5 Flash 169 tok/s Pro
Gemini 2.5 Pro 47 tok/s Pro
Kimi K2 194 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

SpatialThinker-7B: Advanced Spatial Reasoning

Updated 16 November 2025
  • SpatialThinker-7B is a multimodal model that integrates high-resolution visual inputs with autoregressive language modeling using reinforcement learning and scene-graph grounding.
  • It employs a novel RL framework with lexicographically gated dense spatial rewards to ensure structured observation, scene analysis, and precise answer generation.
  • The model demonstrates robust performance on both synthetic spatial VQA benchmarks and real-world tasks, outperforming prior supervised and RL baselines.

SpatialThinker-7B is a multimodal LLM (MLLM) specifically engineered for advanced spatial reasoning and visual question answering (VQA) using dense spatial rewards and structured scene-graph grounding. Built on the Qwen2.5-VL-7B backbone, it unifies high-resolution visual inputs with autoregressive language modeling, reinforcing 3D-aware reasoning through an online reinforcement learning (RL) framework and a novel data synthesis pipeline. SpatialThinker-7B demonstrates enhanced spatial understanding and generalizes robustly to real-world and abstract VQA tasks, outperforming both supervised and prior RL baselines as well as established proprietary models, all with markedly data-efficient training.

1. Model Architecture and Reasoning Paradigm

SpatialThinker-7B employs the Qwen2.5-VL-7B vision-language backbone, wherein a high-resolution vision encoder (processing RGB images from 512×512512\times512 up to 2048×20482048\times2048 pixels) pairs with a 7-billion parameter autoregressive transformer language decoder. All parameters—including visual and language components—are updated during RL.

The model operationalizes reasoning as a sequential policy πθ\pi_\theta over token sequences y=(s1,,sT,a)y = (s_1, \ldots, s_T, a), given image XimgX_\mathrm{img} and text XtextX_\mathrm{text}: πθ(yXimg,Xtext)=t=1Tπθ(stXimg,Xtext,s<t)πθ(aXimg,Xtext,sT)\pi_\theta(y|X_\mathrm{img}, X_\mathrm{text}) = \prod_{t=1}^T \pi_\theta(s_t|X_\mathrm{img}, X_\mathrm{text}, s_{<t}) \cdot \pi_\theta(a|X_\mathrm{img}, X_\mathrm{text}, s_{\le T})

The reasoning process is externally structured into four stages, each marked by explicit tokens:

  • <observe>: Natural language scene descriptions localizing regions of interest
  • <scene>: JSON-encoded subgraph GqG_q representing objects (with IDs, labels, bounding boxes) and spatial relations (triplets) relevant to the query
  • >: Chain-of-thought step detailing hypothesis and deduction based on GqG_q and the visual context

    • <answer>: Discrete multiple-choice selection

    Cross-attention layers fuse vision-derived features with text embeddings; all scene-graph-related content is passed as explicit JSON within the text stream, treated as regular tokens by the transformer encoder.

    2. STVQA-7K Data Synthesis and Quality Control

    SpatialThinker-7B’s training leverages STVQA-7K, a curated spatial VQA dataset generated via a semi-automated pipeline:

    • Base graphs and predicates derive from human-annotated Visual Genome (VG150). The set is extended with 34 new spatial predicates (e.g., near/far, taller/shorter, inside/beneath, facing_away).

    • Claude Sonnet 4 is employed to synthesize multiple-choice spatial VQA pairs across nine categories: spatial relations, depth ordering, distance, size, orientation, containment, reach/interactions, region location, counting, and existence.
    • Each sample features four answer choices (A–D) and a uniform label distribution.
    • Every instance is rated by Claude Sonnet; from approximately 56,000 generated pairs, the top 10,000 by rating are retained.
    • Filtering is further strengthened by a GPT-4o pass@2 consistency check: a question is kept if at least one GPT-4o answer matches the synthetic label.
    • The final dataset has 7,587 QA pairs, split into 6,895 for training and 692 for validation.
    • Scene-graph adaptation for each QA example uses lemmatized keyword extraction to form question-focused subgraphs GqG_q, keeping bounding boxes in absolute pixel format to support full scale-aware supervision through CIoU.

    3. Reinforcement Learning with Lexicographically Gated Dense Spatial Rewards

    The SpatialThinker-7B RL protocol is governed by a lexicographically gated, multi-objective reward composition: Rtotal=I[Rfmt=1]  (wfmtRfmt+wcntRcnt+waccRacc+I[Racc=1]  wspaRspa)R_{\rm total} = \mathbf{I}[R_{\rm fmt}=1]\;\Bigl(w_{\rm fmt}\,R_{\rm fmt} + w_{\rm cnt}\,R_{\rm cnt} + w_{\rm acc}\,R_{\rm acc} + \mathbf{I}[R_{\rm acc}=1]\;w_{\rm spa}\,R_{\rm spa}\Bigr) with wfmt=0.1w_{\rm fmt}=0.1, wcnt=0.2w_{\rm cnt}=0.2, wacc=0.5w_{\rm acc}=0.5, wspa=0.2w_{\rm spa}=0.2. The indicator function I\mathbf{I} enforces that spatial rewards are only considered if prior format and accuracy conditions are met.

    Reward Components:

    • Format (Rfmt{0,1}R_{\rm fmt} \in \{0,1\}): Ensures ordered presence of <observe>, <scene>, <think>, and <answer> tags, and that scene JSON is parseable.
    • Count (Rcnt[0,1]R_{\rm cnt} \in [0,1]): Penalizes discrepancies in number of predicted/ground-truth objects or relations using weighted normalized error (object: Aobj=0.7A_{\rm obj}=0.7, relation: Arel=0.3A_{\rm rel}=0.3).
    • Accuracy (Racc{0,1}R_{\rm acc} \in \{0,1\}): Binary, based on exact match with the correct answer.
    • Spatial (Rspa[0,1]R_{\rm spa} \in [0,1]): Applied only if Racc=1R_{\rm acc}=1, matching predicted vs. ground-truth objects using the Hungarian algorithm with a composite cost:

    C(oi,oj)=Aspa(1IoU(bi,bj))+Asem(1sim(i,j))C(o_i, o_j) = A_{\rm spa}(1-\mathrm{IoU}(b_i, b_j)) + A_{\rm sem}(1-\mathrm{sim}(\ell_i, \ell_j))

    (Aspa=1.0A_{\rm spa}=1.0, Asem=2.0A_{\rm sem}=2.0), and scored via average Complete IoU, which incorporates area overlap, center distance, and aspect ratio alignment.

    Policy Optimization:

    Group-Relative Policy Optimization (GRPO) is used: for each query, N=8N=8 rollouts are sampled and scored, normalized advantages A(i)A^{(i)} are computed, and a PPO-style clipped loss (KL penalty coefficient: 1×1021\times10^{-2}) is applied.

    4. Scene Graph Representation and Encoding Strategy

    Objects viVqv_i \in V_q are represented by label i\ell_i and bounding box bi=(x1,y1,x2,y2)b_i=(x_1, y_1, x_2, y_2); relations eijEqe_{ij} \in E_q are triplets of the form (vi,rij,vj)(v_i, r_{ij}, v_j), with rijr_{ij} drawn from the extended spatial predicate set.

    • Scene graphs are not modeled by a separate GNN; rather, they are serialized as JSON and processed as part of the language input stream.
    • The text encoder integrates object IDs, label tokens, numerical bbox tokens, and relation names via the standard transformer pipeline.
    • Region-of-interest (RoI) filtering constrains graphs and rewards to question-relevant subgraphs, avoiding reward dilution and overfitting to incidental scene elements.

    5. Training Protocols, Baselines, and Evaluation Benchmarks

    Training is conducted from the base model (Qwen2.5-VL-7B) with no prior supervised fine-tuning on STVQA-7K. Reinforcement learning proceeds for approximately 15 hours on 4×NVIDIA H100 GPUs, using AdamW (lr=1×1061 \times 10^{-6}), weight decay (1×1021 \times 10^{-2}), BF16 precision, and context length of 16,384 tokens.

    Comparative Baselines:

    • Supervised fine-tuning (SFT) with LoRA (lr=1×1041 \times 10^{-4}, 3 epochs)
    • Vanilla GRPO using only format/accuracy rewards (both weights 0.5)
    • Proprietary models: GPT-4o, Claude 3.5 Sonnet
    • Open-source generalist MLLMs: Qwen2.5-VL, LLaVA-NeXT, Cambrian-1, VLAA-Thinker
    • Open-source spatial MLLMs: SpaceLLaVA, SpatialRGPT, RoboPoint, SpaceThinker, SpaceOm, SpatialReasoner, SpatialBot, Visionary-R1, SATORI-R1

    Evaluation Benchmarks:

    • Six spatial: CV-Bench (2D/3D), BLINK (spatial, relative depth), 3DSRBench, MMVP, SpatialBench, SpatialReasonerEval
    • Six real-world/general VQA: MM-Star, VStarBench, RealWorldQA, MME-RealWorld-Lite, RoboSpatial-Home (configuration/compatibility), HallusionBench

    6. Empirical Performance and Ablation Analyses

    SpatialThinker-7B demonstrates substantial improvements on both spatial and general VQA metrics:

    Benchmark SFT Vanilla GRPO GPT-4o SpatialThinker-7B
    CV-Bench (2D/3D) 70.0% 72.7% 79.4% 78.2%
    3DSRBench 44.3% 56.4%
    BLINK (avg) 80.4% 79.3%
    Real-World VQA (avg) 65.8% 65.2% 66.2% 69.7%
    12-Benchmark Mean 64.0% 67.8% 71.2%
    • On 3DSRBench, SpatialThinker-7B attains 56.4%, exceeding GPT-4o by +12.1 percentage points.
    • Across six spatial benchmarks, consistently outperforms all open-source spatial MLLMs despite using only 7,000 training samples—orders of magnitude less than competitors trained on millions or with explicit RGB-D input.
    • Real-world VQA metrics indicate robustness and grounding; gains are notable on MM-Star, RoboSpatial-Home, and VStarBench.
    • Dense spatial rewards provide a nearly doubled RL improvement (+7.2% versus sparse RL gain of +4.0%).

    Ablation studies:

    • Naïve addition of count+spatial rewards produces reward hacking (23.7%).
    • Lexicographic gating and RoI filtering restore reward utility (76.3%).
    • Final filtering yields an STVQA-7K validation accuracy of 87.9%.
    • KL regularization proves optimal (KL(0.01): 73.7% on CV-Bench, 3B version), outperforming no-KL or χ2\chi^2 constraints.

    7. Insights, Contributions, and Limitations

    SpatialThinker-7B’s principal innovation is the integration of explicit scene-graph grounding and lexicographically gated dense rewards, operationalizing a human-like observe \rightarrow localize \rightarrow think \rightarrow answer pipeline. This structured reward approach notably surpasses sparse RL and SFT while maintaining high data efficiency.

    Further, the approach demonstrates robust generalization to both in-domain (synthetic spatial VQA) and out-of-domain (real-world visual and abstract reasoning) benchmarks, with competitive performance relative to leading proprietary and open-source models.

    Identified limitations:

    • Requires explicit scene graph generation and bounding-box labels.
    • Does not yet support implicit spatial reasoning within latent representations or omitting explicit JSON scene-graphs.
    • Extensions to spatiotemporal vision (e.g., video, navigation) and unified multi-objective policies remain open directions.

    References to architecture diagrams and reward dynamics are provided in Figures 1 and 6–7 of the source publication.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SpatialThinker-7B.