SpaceVista-7B: All-Scale Spatial Reasoning
- SpaceVista-7B is a multimodal spatial reasoning model that operates across scales from millimeters to kilometers, addressing indoor, outdoor, and aerial scenarios.
- It employs a dense feature fusion mechanism with scale-aware experts to integrate geometric cues beyond basic semantic features.
- Evaluations on multiple benchmarks demonstrate significant performance gains over baselines, validating its scale-conditioned training and progressive reward design.
SpaceVista-7B is a 7B-parameter multimodal spatial reasoning model for all-scale visual spatial reasoning in videos, designed to operate across real-world spatial scales from millimeters to kilometers (Sun et al., 10 Oct 2025). It is presented together with SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs over 38K video scenes, and SpaceVista-Bench, a manually assembled benchmark for scale-sensitive evaluation. The system is positioned against prior approaches that are largely indoor-only, rely heavily on 3D scans and manual labeling, and often overfit to single-scene distributions without effective scale generalization. Its central technical claim is that all-scale spatial intelligence in MLLMs benefits from three coupled components: a specialist-driven automated data pipeline, dense feature fusion beyond semantics, and scale-aware training with expert routing and progressive rewards (Sun et al., 10 Oct 2025).
1. Problem setting and scale taxonomy
SpaceVista-7B targets spatial reasoning tasks that span tiny object manipulation, tabletop planning, indoor layout understanding, outdoor landmark reasoning, and aerial area estimation (Sun et al., 10 Oct 2025). The motivating observation is that recent spatial reasoning work has made progress on indoor scenes, but remains limited for robotics, autonomous driving, and other settings requiring reliable reasoning across substantially different physical scales.
The paper organizes this challenge around five spatial scales. Tiny tabletop corresponds to approximately $2$ mm–$5$ cm, tabletop to $5$ cm–$2$ m, indoor to $0.5$–$20$ m, outdoor to $0.5$–$500$ m, and drone-view to $10$ m–$0.7$ km (Sun et al., 10 Oct 2025). The full dataset covers approximately $5$0 mm–$5$1 km. This taxonomy is not merely descriptive: it functions as a training and evaluation scaffold, and “scale” is treated as an anchor in reward shaping and as an implicit conditioning signal through expert routing.
The structured task taxonomy spans 19 task types. General tasks include Position Comparison, Size Comparison, Existence Estimation, Object Counting, Rotation Estimation, Relative Distance, Absolute Distance, Object Size, Route Planning, Appearance Order, Depth Estimation, View Change Inference, Object Matching, and Spatial Relation. Room Size is indoor-specific; Navigation is outdoor-specific; Area Estimation and Route Plan are drone-view-specific; and Object Location, Destination Location, Obstacles Location, and Manipulation Planning are tabletop-specific (Sun et al., 10 Oct 2025).
A plausible implication is that the model is not framed as a single-scene geometric reasoner, but as a scale-conditioned reasoning system over heterogeneous visual regimes. The paper explicitly links manipulation tasks to tiny/tabletop scales, volume and area estimation to indoor/drone scales, and path/navigation to indoor, outdoor, and drone scenarios.
2. Dataset and benchmark infrastructure
SpaceVista-1M contains approximately 1,014,000 QA pairs across 19 task types and 50+ subscene categories, derived from approximately 38,000 video scenes with a QA/scene ratio of approximately 25 (Sun et al., 10 Oct 2025). The curation process is specialist-driven and automated. Its sources include DL3DV for indoor, outdoor, and drone data; WildRGB-D and SMOT for tabletop data; uCO3D for tiny tabletop data; ScanNet and ScanNet++ for indoor scenes; and the authors’ own recorded videos.
Camera intrinsics and extrinsics are either known or reconstructed, for example with COLMAP, enabling metric computation (Sun et al., 10 Oct 2025). For geometry-related supervision during data construction, the pipeline uses Metric3Dv2 and UniDepthV2 for metric depth and normals, together with Video-Depth-Anything plus an energy-minimization procedure for temporal consistency:
$5$2
Here $5$3 and $5$4 denote metric depth and Video-Depth-Anything maps, respectively (Sun et al., 10 Oct 2025).
Grounded semantics are provided by proprietary DINO-X and GroundingDINO for per-frame categories and boxes, while SAM2 is used for cross-frame mask tracking and object ID association in a Grounded-SAM2 configuration (Sun et al., 10 Oct 2025). QA generation mixes template-based construction for constrained tasks, using 3,000+ curated templates, with GPT-based generation using Qwen2.5-VL-72B-Instruct and Gemini-2.5-Pro for flexible tasks such as planning. CoT rationales are produced with cognition-inspired few-shot prompting using Qwen2.5-VL-72B and filtered for rationale quality and consistency.
Extended referential inputs are a notable design feature. Each QA item may be conditioned on a point, bounding box, or mask, and each input form has dedicated templates and CoT (Sun et al., 10 Oct 2025). This supports not only descriptive QA but also interactive tasks such as object counting or manipulation with a specified referent.
Quality control distinguishes between “perceptual correctness,” used to filter training data, and “strict correctness,” reserved for evaluation (Sun et al., 10 Oct 2025). The benchmark, SpaceVista-Bench, contains more than 3,000 fully human-annotated QA pairs across approximately 500 unique video scenes, with reported accuracy of approximately 99% for benchmark answers. The benchmark is constructed by measuring and recording real objects at tiny and tabletop scales, retrieving landmark statistics from authoritative sources such as Wikipedia for indoor and outdoor scales, and using human annotation for motion and non-distance tasks.
| Component | Scope | Key property |
|---|---|---|
| SpaceVista-1M | $5$5 scenes, $5$6 QA pairs | Five scales, 19 task types |
| SpaceVista-Bench | $5$7 QA pairs, $5$8 scenes | Fully human-annotated, strict correctness |
| Input extensions | Point, bounding box, mask | Referent specification for QA and interaction |
3. Model architecture and multimodal inputs
The language backbone of SpaceVista-7B is Qwen2.5-VL-7B-Instruct, and the vision backbone is the Qwen2.5-VL visual encoder for semantic video tokens (Sun et al., 10 Oct 2025). The architecture adds a dense “beyond semantics” encoder based on DINOv3, described as self-supervised and intended to provide patch-level dense features such as depth-, normal-, and pattern-like cues that semantic tokenizers often miss.
The model accepts RGB video frames, with up to 32 training frames; evaluation follows the official Qwen2.5-VL settings with $5$9 and temperature $5$0 (Sun et al., 10 Oct 2025). It also supports referential conditioning in key frames through points, boxes, and masks. Although dataset construction uses metric depth, normals, and camera intrinsics/extrinsics to generate labels, the model itself fuses dense self-supervised features from DINOv3 with semantic video features through cross-attention. VGGT geometry can also be injected in ablation, but DINOv3 proved more robust (Sun et al., 10 Oct 2025).
The dense fusion mechanism is defined over semantic video features $5$1 and dense features $5$2. After spatial and temporal alignment, the fused features are computed as
$5$3
where $5$4 denotes multi-layer cross-attention (Sun et al., 10 Oct 2025). Each element of $5$5 is then converted to image tokens for the LLM. The paper’s interpretation is that this fusion supplies geometric and dense cues absent from contrastively trained semantic tokenizers.
The second architectural pillar is a set of scale-aware LoRA-like experts attached to each language-model layer. The projection is
$5$6
with learned per-expert gating $5$7 from a router implemented as an MLP plus softmax (Sun et al., 10 Oct 2025). Here $5$8 is the frozen base weight, $5$9, $2$0, and $2$1. The paper reports that only about $2$2 of parameters are fine-tuned per expert.
This expert structure is explicitly motivated by cross-scale knowledge conflicts. The paper identifies naive mixing of mm- and km-scale data as a source of interference, and uses layer-wise routing so that different layers can select different experts for different scale characteristics (Sun et al., 10 Oct 2025). This suggests a decomposition in which scale specialization is distributed across the depth of the LLM rather than encoded as a single global switch.
4. Scale-aware training and progressive rewards
Training proceeds in stages. First, the model is SFT-trained on SpaceVista-1M with CoT rationales to inject foundational spatial knowledge (Sun et al., 10 Oct 2025). During this phase, the vision projection, fusion modules, and scale experts are optimized; the vision tower and projector are typically frozen to preserve pretrained semantics, while the LLM remains trainable. A second SFT step introduces the scale router so that each expert is allocated to appropriate inputs and layers, encouraging specialization without overfitting.
The RL stage uses GRPO with progressive anchors. The anchors are ordered as $2$3, reflecting the reasoning sequence described in the paper: identify relevant objects, estimate scene scale via references, then solve the spatial problem (Sun et al., 10 Oct 2025). Rewards combine answer correctness with intermediate anchor consistency. The updated correctness reward aggregates anchors in order $2$4, and the paper defines $2$5 through scene-scale estimates in a common unit and $2$6 through the cosine-style agreement of semantic embeddings for referenced objects (Sun et al., 10 Oct 2025). Anchors are used for RL reward shaping, but not during evaluation beyond the final answer.
The policy objective is a groupwise normalized GRPO objective with KL regularization to a reference model. Advantages are computed by groupwise normalization,
$2$7
with $2$8 combining $2$9 and $0.5$0 (Sun et al., 10 Oct 2025). RL is trained for 2.5k steps with GRPO group size 8.
A plausible implication is that the reward design attempts to regularize reasoning trajectories rather than only end answers. In the paper’s framing, scale is neither an explicit token nor a purely latent nuisance variable; it is an ordered anchor that stabilizes training and encourages reusable scale-specific abstractions.
5. Empirical performance and ablation results
Evaluation is reported on five benchmarks: the video benchmarks SpaceVista-Bench, VSI-Bench, and STI-Bench, and the multi-image benchmarks MMSI-Bench and SPAR-Bench (Sun et al., 10 Oct 2025). Metrics are accuracy or score percentage depending on each benchmark’s protocol, and decoding follows Qwen2.5-VL demo settings with $0.5$1 and temperature $0.5$2.
For open-source general baselines in the 7–8B class, LLaVA-OneVision-7B, LLaVA-Next-Video-7B, InternVL3.5-8B, and Qwen2.5-VL-7B score in the ranges 13.6–31.7% on MMSI, 30.6–36.0% on SPAR, 32.4–38.2% on VSI, 29.0–33.2% on STI, and 13.6–28.9% on SpaceVista-Bench (Sun et al., 10 Oct 2025). Open-source specialized baselines including SpaceR-7B, SpatialMLLM-4B, VILASR-7B, and VG-LLM-4B score 21.2–37.6% on SPAR and 28.8–48.4% on VSI, but underperform on the all-scale SpaceVista-Bench, where they obtain 21.2–28.8%.
SpaceVista-7B improves over these baselines across all five benchmarks. Without RL, it achieves 29.1% on MMSI, 38.1% on SPAR, 46.3% on VSI, 35.9% on STI, and 34.5% on SpaceVista-Bench (Sun et al., 10 Oct 2025). With RL, the scores rise to 32.3%, 41.6%, 48.6%, 38.2%, and 36.7%, respectively.
| Model setting | MMSI | SPAR | VSI | STI | SpaceVista-Bench |
|---|---|---|---|---|---|
| SpaceVista-7B | 29.1% | 38.1% | 46.3% | 35.9% | 34.5% |
| SpaceVista-7B with RL | 32.3% | 41.6% | 48.6% | 38.2% | 36.7% |
On the SpaceVista-Bench leaderboard, SpaceVista-7B with RL achieves 36.7% overall and is reported as top or second-best across Tiny Tabletop at 33.4%, Tabletop at 37.1%, Indoor at 42.2%, and Outdoor at 34.1% among open-source models (Sun et al., 10 Oct 2025). The paper further states that it outperforms general 7–72B open-source models by approximately 6% or more on comprehensive all-scale spatial reasoning.
Ablation results attribute gains to the three main design choices. In a 3B ablation, a vanilla model obtains 44.4% on VSI and 31.0% on SpaceVista-Bench. Adding a scale anchor yields gains of $0.5$3 and $0.5$4 points; adding scale plus semantic anchors yields $0.5$5 and $0.5$6 points; and expert fine-tuning yields an additional $0.5$7 and $0.5$8 points (Sun et al., 10 Oct 2025). Increasing the number of experts from none to four raises SpaceVista-Bench from 31.0% to 32.9%. For dense encoders, VGGT gives a small or negative change, with VSI $0.5$9 and SpaceVista-Bench $20$0, whereas DINOv3 gives larger gains, with VSI $20$1 and SpaceVista-Bench $20$2. A further ablation reports that 2.5D renders from geometry are more robust than raw 3D features under low-resolution or noisy inputs.
6. Implementation, limitations, and release status
Main experiments on the 7B model use up to 16 NVIDIA A800 80GB GPUs, while ablations are mainly conducted on a 3B model (Sun et al., 10 Oct 2025). SFT runs for 2 epochs on the CoT subset with DeepSpeed ZeRO-2, mixed-precision bf16, a cosine learning-rate schedule with initial learning rate $20$3, 10% warmup, and sequences truncated at 32,768 tokens. RL uses GRPO on multi-choice and regression subsets for 2.5k steps on 7 GPUs with DeepSpeed, bf16 with flash attention, batch size per device 1, gradient accumulation 1, learning rate $20$4, weight decay 0.01, inputs up to 16,384 tokens, outputs up to 1,024 tokens, evaluation every 200 steps, and vLLM inference at temperature 1.0 with 8 samples per input (Sun et al., 10 Oct 2025).
Training uses up to 32 frames, each at “128 × 28 × 28 pixels” in the paper’s internal tokenizer representation, while inference increases resolution to “256 × 28 × 28 pixels” (Sun et al., 10 Oct 2025). The model uses > ... for CoT-style rationales during SFT and <answer> ... </answer> for final outputs, while RL employs <semantic>, <scale>, and <answer> anchors internally for reward shaping only.
The paper identifies several limitations. Current coverage from mm to km still omits $20$5m-scale surgery, sub-mm industrial precision, and multi-km satellite or cartographic reasoning (Sun et al., 10 Oct 2025). Specialist models for metric depth and grounding introduce noise that can propagate despite temporal consistency and object tracking. Knowledge conflicts are reduced but not eliminated; the model can still memorize typical object sizes, such as chairs at 50–70 cm, and fail on unusual cases. Raw geometry features via VGGT are less robust without strong decoders, while 2.5D renders better align with pretrained image tokenizers. Outdoor and drone scenes remain harder than indoor scenes because they require better long-range scale estimation and semantic grounding.
The stated future direction is to extend coverage to additional scales, integrate domain-specific sensing, improve direct geometry-conditioned reasoning, rebalance the data and multi-scale curricula, and scale model size from 3B to 7B to 32B (Sun et al., 10 Oct 2025). SpaceVista-1M, SpaceVista-Bench, and the SpaceVista-7B model are to be released on the project page, with licensing under CC BY 4.0 or Apache 2.0 consistent with source licenses. The release includes free-form, multiple-choice, and regression QA formats, CoT rationales for SFT, and extended point/box/mask inputs (Sun et al., 10 Oct 2025).
The overall significance of SpaceVista-7B lies in its attempt to formalize all-scale visual spatial reasoning as a unified MLLM problem rather than a collection of indoor-only or scene-specific subproblems. The paper argues that its performance gains arise primarily from dense feature fusion beyond semantics, scale-aware experts that reduce cross-scale knowledge conflicts, and progressive scale-anchored rewards. This suggests a broader research direction in which spatial intelligence is treated as a scale-structured competence requiring both geometric cues and training mechanisms that explicitly manage interference across heterogeneous physical regimes (Sun et al., 10 Oct 2025).