SpatialThinker-7B: Advanced Spatial Reasoning
- SpatialThinker-7B is a multimodal model that integrates high-resolution visual inputs with autoregressive language modeling using reinforcement learning and scene-graph grounding.
- It employs a novel RL framework with lexicographically gated dense spatial rewards to ensure structured observation, scene analysis, and precise answer generation.
- The model demonstrates robust performance on both synthetic spatial VQA benchmarks and real-world tasks, outperforming prior supervised and RL baselines.
SpatialThinker-7B is a multimodal LLM (MLLM) specifically engineered for advanced spatial reasoning and visual question answering (VQA) using dense spatial rewards and structured scene-graph grounding. Built on the Qwen2.5-VL-7B backbone, it unifies high-resolution visual inputs with autoregressive language modeling, reinforcing 3D-aware reasoning through an online reinforcement learning (RL) framework and a novel data synthesis pipeline. SpatialThinker-7B demonstrates enhanced spatial understanding and generalizes robustly to real-world and abstract VQA tasks, outperforming both supervised and prior RL baselines as well as established proprietary models, all with markedly data-efficient training.
1. Model Architecture and Reasoning Paradigm
SpatialThinker-7B employs the Qwen2.5-VL-7B vision-language backbone, wherein a high-resolution vision encoder (processing RGB images from up to pixels) pairs with a 7-billion parameter autoregressive transformer language decoder. All parameters—including visual and language components—are updated during RL.
The model operationalizes reasoning as a sequential policy over token sequences , given image and text :
The reasoning process is externally structured into four stages, each marked by explicit tokens:
<observe>: Natural language scene descriptions localizing regions of interest<scene>: JSON-encoded subgraph representing objects (with IDs, labels, bounding boxes) and spatial relations (triplets) relevant to the query>: Chain-of-thought step detailing hypothesis and deduction based on and the visual context<answer>: Discrete multiple-choice selection
Cross-attention layers fuse vision-derived features with text embeddings; all scene-graph-related content is passed as explicit JSON within the text stream, treated as regular tokens by the transformer encoder.
2. STVQA-7K Data Synthesis and Quality Control
SpatialThinker-7B’s training leverages STVQA-7K, a curated spatial VQA dataset generated via a semi-automated pipeline:
Base graphs and predicates derive from human-annotated Visual Genome (VG150). The set is extended with 34 new spatial predicates (e.g., near/far, taller/shorter, inside/beneath, facing_away).
- Claude Sonnet 4 is employed to synthesize multiple-choice spatial VQA pairs across nine categories: spatial relations, depth ordering, distance, size, orientation, containment, reach/interactions, region location, counting, and existence.
- Each sample features four answer choices (A–D) and a uniform label distribution.
- Every instance is rated by Claude Sonnet; from approximately 56,000 generated pairs, the top 10,000 by rating are retained.
- Filtering is further strengthened by a GPT-4o pass@2 consistency check: a question is kept if at least one GPT-4o answer matches the synthetic label.
- The final dataset has 7,587 QA pairs, split into 6,895 for training and 692 for validation.
- Scene-graph adaptation for each QA example uses lemmatized keyword extraction to form question-focused subgraphs , keeping bounding boxes in absolute pixel format to support full scale-aware supervision through CIoU.
3. Reinforcement Learning with Lexicographically Gated Dense Spatial Rewards
The SpatialThinker-7B RL protocol is governed by a lexicographically gated, multi-objective reward composition: with , , , . The indicator function enforces that spatial rewards are only considered if prior format and accuracy conditions are met.
Reward Components:
- Format (): Ensures ordered presence of <observe>, <scene>, <think>, and <answer> tags, and that scene JSON is parseable.
- Count (): Penalizes discrepancies in number of predicted/ground-truth objects or relations using weighted normalized error (object: , relation: ).
- Accuracy (): Binary, based on exact match with the correct answer.
- Spatial (): Applied only if , matching predicted vs. ground-truth objects using the Hungarian algorithm with a composite cost:
(, ), and scored via average Complete IoU, which incorporates area overlap, center distance, and aspect ratio alignment.
Policy Optimization:
Group-Relative Policy Optimization (GRPO) is used: for each query, rollouts are sampled and scored, normalized advantages are computed, and a PPO-style clipped loss (KL penalty coefficient: ) is applied.
4. Scene Graph Representation and Encoding Strategy
Objects are represented by label and bounding box ; relations are triplets of the form , with drawn from the extended spatial predicate set.
- Scene graphs are not modeled by a separate GNN; rather, they are serialized as JSON and processed as part of the language input stream.
- The text encoder integrates object IDs, label tokens, numerical bbox tokens, and relation names via the standard transformer pipeline.
- Region-of-interest (RoI) filtering constrains graphs and rewards to question-relevant subgraphs, avoiding reward dilution and overfitting to incidental scene elements.
5. Training Protocols, Baselines, and Evaluation Benchmarks
Training is conducted from the base model (Qwen2.5-VL-7B) with no prior supervised fine-tuning on STVQA-7K. Reinforcement learning proceeds for approximately 15 hours on 4×NVIDIA H100 GPUs, using AdamW (lr=), weight decay (), BF16 precision, and context length of 16,384 tokens.
Comparative Baselines:
- Supervised fine-tuning (SFT) with LoRA (lr=, 3 epochs)
- Vanilla GRPO using only format/accuracy rewards (both weights 0.5)
- Proprietary models: GPT-4o, Claude 3.5 Sonnet
- Open-source generalist MLLMs: Qwen2.5-VL, LLaVA-NeXT, Cambrian-1, VLAA-Thinker
- Open-source spatial MLLMs: SpaceLLaVA, SpatialRGPT, RoboPoint, SpaceThinker, SpaceOm, SpatialReasoner, SpatialBot, Visionary-R1, SATORI-R1
Evaluation Benchmarks:
- Six spatial: CV-Bench (2D/3D), BLINK (spatial, relative depth), 3DSRBench, MMVP, SpatialBench, SpatialReasonerEval
- Six real-world/general VQA: MM-Star, VStarBench, RealWorldQA, MME-RealWorld-Lite, RoboSpatial-Home (configuration/compatibility), HallusionBench
6. Empirical Performance and Ablation Analyses
SpatialThinker-7B demonstrates substantial improvements on both spatial and general VQA metrics:
Benchmark SFT Vanilla GRPO GPT-4o SpatialThinker-7B CV-Bench (2D/3D) 70.0% 72.7% 79.4% 78.2% 3DSRBench – – 44.3% 56.4% BLINK (avg) – – 80.4% 79.3% Real-World VQA (avg) 65.8% 65.2% 66.2% 69.7% 12-Benchmark Mean 64.0% – 67.8% 71.2% - On 3DSRBench, SpatialThinker-7B attains 56.4%, exceeding GPT-4o by +12.1 percentage points.
- Across six spatial benchmarks, consistently outperforms all open-source spatial MLLMs despite using only 7,000 training samples—orders of magnitude less than competitors trained on millions or with explicit RGB-D input.
- Real-world VQA metrics indicate robustness and grounding; gains are notable on MM-Star, RoboSpatial-Home, and VStarBench.
- Dense spatial rewards provide a nearly doubled RL improvement (+7.2% versus sparse RL gain of +4.0%).
Ablation studies:
- Naïve addition of count+spatial rewards produces reward hacking (23.7%).
- Lexicographic gating and RoI filtering restore reward utility (76.3%).
- Final filtering yields an STVQA-7K validation accuracy of 87.9%.
- KL regularization proves optimal (KL(0.01): 73.7% on CV-Bench, 3B version), outperforming no-KL or constraints.
7. Insights, Contributions, and Limitations
SpatialThinker-7B’s principal innovation is the integration of explicit scene-graph grounding and lexicographically gated dense rewards, operationalizing a human-like observe localize think answer pipeline. This structured reward approach notably surpasses sparse RL and SFT while maintaining high data efficiency.
Further, the approach demonstrates robust generalization to both in-domain (synthetic spatial VQA) and out-of-domain (real-world visual and abstract reasoning) benchmarks, with competitive performance relative to leading proprietary and open-source models.
Identified limitations:
- Requires explicit scene graph generation and bounding-box labels.
- Does not yet support implicit spatial reasoning within latent representations or omitting explicit JSON scene-graphs.
- Extensions to spatiotemporal vision (e.g., video, navigation) and unified multi-objective policies remain open directions.
References to architecture diagrams and reward dynamics are provided in Figures 1 and 6–7 of the source publication.