SpatialThinker-7B: Advanced Spatial Reasoning

Updated 16 November 2025

SpatialThinker-7B is a multimodal model that integrates high-resolution visual inputs with autoregressive language modeling using reinforcement learning and scene-graph grounding.
It employs a novel RL framework with lexicographically gated dense spatial rewards to ensure structured observation, scene analysis, and precise answer generation.
The model demonstrates robust performance on both synthetic spatial VQA benchmarks and real-world tasks, outperforming prior supervised and RL baselines.

SpatialThinker-7B is a multimodal LLM (MLLM) specifically engineered for advanced spatial reasoning and visual question answering (VQA) using dense spatial rewards and structured scene-graph grounding. Built on the Qwen2.5-VL-7B backbone, it unifies high-resolution visual inputs with autoregressive language modeling, reinforcing 3D-aware reasoning through an online reinforcement learning (RL) framework and a novel data synthesis pipeline. SpatialThinker-7B demonstrates enhanced spatial understanding and generalizes robustly to real-world and abstract VQA tasks, outperforming both supervised and prior RL baselines as well as established proprietary models, all with markedly data-efficient training.

1. Model Architecture and Reasoning Paradigm

SpatialThinker-7B employs the Qwen2.5-VL-7B vision-language backbone, wherein a high-resolution vision encoder (processing RGB images from $512\times512$ up to $2048\times2048$ pixels) pairs with a 7-billion parameter autoregressive transformer language decoder. All parameters—including visual and language components—are updated during RL.

The model operationalizes reasoning as a sequential policy $\pi_\theta$ over token sequences $y = (s_1, \ldots, s_T, a)$ , given image $X_\mathrm{img}$ and text $X_\mathrm{text}$ : $\pi_\theta(y|X_\mathrm{img}, X_\mathrm{text}) = \prod_{t=1}^T \pi_\theta(s_t|X_\mathrm{img}, X_\mathrm{text}, s_{<t}) \cdot \pi_\theta(a|X_\mathrm{img}, X_\mathrm{text}, s_{\le T})$

The reasoning process is externally structured into four stages, each marked by explicit tokens:

<observe>: Natural language scene descriptions localizing regions of interest
<scene>: JSON-encoded subgraph $G_q$ representing objects (with IDs, labels, bounding boxes) and spatial relations (triplets) relevant to the query

>: Chain-of-thought step detailing hypothesis and deduction based on

G_q

and the visual context

<answer>: Discrete multiple-choice selection

Cross-attention layers fuse vision-derived features with text embeddings; all scene-graph-related content is passed as explicit JSON within the text stream, treated as regular tokens by the transformer encoder.

2. STVQA-7K Data Synthesis and Quality Control

SpatialThinker-7B’s training leverages STVQA-7K, a curated spatial VQA dataset generated via a semi-automated pipeline:

Base graphs and predicates derive from human-annotated Visual Genome (VG150). The set is extended with 34 new spatial predicates (e.g., near/far, taller/shorter, inside/beneath, facing_away).

Claude Sonnet 4 is employed to synthesize multiple-choice spatial VQA pairs across nine categories: spatial relations, depth ordering, distance, size, orientation, containment, reach/interactions, region location, counting, and existence.

Each sample features four answer choices (A–D) and a uniform label distribution.

Every instance is rated by Claude Sonnet; from approximately 56,000 generated pairs, the top 10,000 by rating are retained.

Filtering is further strengthened by a GPT-4o pass@2 consistency check: a question is kept if at least one GPT-4o answer matches the synthetic label.

The final dataset has 7,587 QA pairs, split into 6,895 for training and 692 for validation.

Scene-graph adaptation for each QA example uses lemmatized keyword extraction to form question-focused subgraphs $G_q$ , keeping bounding boxes in absolute pixel format to support full scale-aware supervision through CIoU.

3. Reinforcement Learning with Lexicographically Gated Dense Spatial Rewards

The SpatialThinker-7B RL protocol is governed by a lexicographically gated, multi-objective reward composition: $R_{\rm total} = \mathbf{I}[R_{\rm fmt}=1]\;\Bigl(w_{\rm fmt}\,R_{\rm fmt} + w_{\rm cnt}\,R_{\rm cnt} + w_{\rm acc}\,R_{\rm acc} + \mathbf{I}[R_{\rm acc}=1]\;w_{\rm spa}\,R_{\rm spa}\Bigr)$ with $w_{\rm fmt}=0.1$ , $w_{\rm cnt}=0.2$ , $w_{\rm acc}=0.5$ , $w_{\rm spa}=0.2$ . The indicator function $\mathbf{I}$ enforces that spatial rewards are only considered if prior format and accuracy conditions are met.

Reward Components:

Format ( $R_{\rm fmt} \in \{0,1\}$ ): Ensures ordered presence of <observe>, <scene>, <think>, and <answer> tags, and that scene JSON is parseable.

Count ( $R_{\rm cnt} \in [0,1]$ ): Penalizes discrepancies in number of predicted/ground-truth objects or relations using weighted normalized error (object: $A_{\rm obj}=0.7$ , relation: $A_{\rm rel}=0.3$ ).

Accuracy ( $R_{\rm acc} \in \{0,1\}$ ): Binary, based on exact match with the correct answer.

Spatial ( $R_{\rm spa} \in [0,1]$ ): Applied only if $R_{\rm acc}=1$ , matching predicted vs. ground-truth objects using the Hungarian algorithm with a composite cost:

$C(o_i, o_j) = A_{\rm spa}(1-\mathrm{IoU}(b_i, b_j)) + A_{\rm sem}(1-\mathrm{sim}(\ell_i, \ell_j))$

( $A_{\rm spa}=1.0$ , $A_{\rm sem}=2.0$ ), and scored via average Complete IoU, which incorporates area overlap, center distance, and aspect ratio alignment.

Policy Optimization:

Group-Relative Policy Optimization (GRPO) is used: for each query, $N=8$ rollouts are sampled and scored, normalized advantages $A^{(i)}$ are computed, and a PPO-style clipped loss (KL penalty coefficient: $1\times10^{-2}$ ) is applied.

4. Scene Graph Representation and Encoding Strategy

Objects $v_i \in V_q$ are represented by label $\ell_i$ and bounding box $b_i=(x_1, y_1, x_2, y_2)$ ; relations $e_{ij} \in E_q$ are triplets of the form $(v_i, r_{ij}, v_j)$ , with $r_{ij}$ drawn from the extended spatial predicate set.

Scene graphs are not modeled by a separate GNN; rather, they are serialized as JSON and processed as part of the language input stream.

The text encoder integrates object IDs, label tokens, numerical bbox tokens, and relation names via the standard transformer pipeline.

Region-of-interest (RoI) filtering constrains graphs and rewards to question-relevant subgraphs, avoiding reward dilution and overfitting to incidental scene elements.

5. Training Protocols, Baselines, and Evaluation Benchmarks

Training is conducted from the base model (Qwen2.5-VL-7B) with no prior supervised fine-tuning on STVQA-7K. Reinforcement learning proceeds for approximately 15 hours on 4×NVIDIA H100 GPUs, using AdamW (lr= $1 \times 10^{-6}$ ), weight decay ( $1 \times 10^{-2}$ ), BF16 precision, and context length of 16,384 tokens.

Comparative Baselines:

Supervised fine-tuning (SFT) with LoRA (lr= $1 \times 10^{-4}$ , 3 epochs)

Vanilla GRPO using only format/accuracy rewards (both weights 0.5)

Proprietary models: GPT-4o, Claude 3.5 Sonnet

Open-source generalist MLLMs: Qwen2.5-VL, LLaVA-NeXT, Cambrian-1, VLAA-Thinker

Open-source spatial MLLMs: SpaceLLaVA, SpatialRGPT, RoboPoint, SpaceThinker, SpaceOm, SpatialReasoner, SpatialBot, Visionary-R1, SATORI-R1

Evaluation Benchmarks:

Six spatial: CV-Bench (2D/3D), BLINK (spatial, relative depth), 3DSRBench, MMVP, SpatialBench, SpatialReasonerEval

Six real-world/general VQA: MM-Star, VStarBench, RealWorldQA, MME-RealWorld-Lite, RoboSpatial-Home (configuration/compatibility), HallusionBench

6. Empirical Performance and Ablation Analyses

SpatialThinker-7B demonstrates substantial improvements on both spatial and general VQA metrics:

Benchmark SFT Vanilla GRPO GPT-4o SpatialThinker-7B

CV-Bench (2D/3D) 70.0% 72.7% 79.4% 78.2%

3DSRBench – – 44.3% 56.4%

BLINK (avg) – – 80.4% 79.3%

Real-World VQA (avg) 65.8% 65.2% 66.2% 69.7%

12-Benchmark Mean 64.0% – 67.8% 71.2%

On 3DSRBench, SpatialThinker-7B attains 56.4%, exceeding GPT-4o by +12.1 percentage points.

Across six spatial benchmarks, consistently outperforms all open-source spatial MLLMs despite using only 7,000 training samples—orders of magnitude less than competitors trained on millions or with explicit RGB-D input.

Real-world VQA metrics indicate robustness and grounding; gains are notable on MM-Star, RoboSpatial-Home, and VStarBench.

Dense spatial rewards provide a nearly doubled RL improvement (+7.2% versus sparse RL gain of +4.0%).

Ablation studies:

Naïve addition of count+spatial rewards produces reward hacking (23.7%).

Lexicographic gating and RoI filtering restore reward utility (76.3%).

Final filtering yields an STVQA-7K validation accuracy of 87.9%.

KL regularization proves optimal (KL(0.01): 73.7% on CV-Bench, 3B version), outperforming no-KL or $\chi^2$ constraints.

7. Insights, Contributions, and Limitations

SpatialThinker-7B’s principal innovation is the integration of explicit scene-graph grounding and lexicographically gated dense rewards, operationalizing a human-like observe $\rightarrow$ localize $\rightarrow$ think $\rightarrow$ answer pipeline. This structured reward approach notably surpasses sparse RL and SFT while maintaining high data efficiency.

Further, the approach demonstrates robust generalization to both in-domain (synthetic spatial VQA) and out-of-domain (real-world visual and abstract reasoning) benchmarks, with competitive performance relative to leading proprietary and open-source models.

Identified limitations:

Requires explicit scene graph generation and bounding-box labels.

Does not yet support implicit spatial reasoning within latent representations or omitting explicit JSON scene-graphs.

Extensions to spatiotemporal vision (e.g., video, navigation) and unified multi-objective policies remain open directions.

References to architecture diagrams and reward dynamics are provided in Figures 1 and 6–7 of the source publication.

Benchmark	SFT	Vanilla GRPO	GPT-4o	SpatialThinker-7B
CV-Bench (2D/3D)	70.0%	72.7%	79.4%	78.2%
3DSRBench	–	–	44.3%	56.4%
BLINK (avg)	–	–	80.4%	79.3%
Real-World VQA (avg)	65.8%	65.2%	66.2%	69.7%
12-Benchmark Mean	64.0%	–	67.8%	71.2%

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SpatialThinker-7B.

SpatialThinker-7B: Advanced Spatial Reasoning

1. Model Architecture and Reasoning Paradigm

2. STVQA-7K Data Synthesis and Quality Control

3. Reinforcement Learning with Lexicographically Gated Dense Spatial Rewards

4. Scene Graph Representation and Encoding Strategy

5. Training Protocols, Baselines, and Evaluation Benchmarks

6. Empirical Performance and Ablation Analyses

7. Insights, Contributions, and Limitations

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SpatialThinker-7B: Advanced Spatial Reasoning

1. Model Architecture and Reasoning Paradigm

2. STVQA-7K Data Synthesis and Quality Control

3. Reinforcement Learning with Lexicographically Gated Dense Spatial Rewards

4. Scene Graph Representation and Encoding Strategy

5. Training Protocols, Baselines, and Evaluation Benchmarks

6. Empirical Performance and Ablation Analyses

7. Insights, Contributions, and Limitations

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research