SenseNova-SI Family in Spatial Intelligence
- SenseNova-SI Family is a suite of open-source multimodal models that enhance spatial reasoning using an 8M-sample QA corpus and capability-balanced, large-scale training.
- The approach leverages data scaling and multi-task fine-tuning to achieve significant benchmark gains in complex spatial tasks without altering underlying architectures.
- All model weights and datasets are publicly available, enabling ongoing research, benchmarking, and practical applications in spatial intelligence.
The SenseNova-SI family represents a suite of open-source multimodal foundation models designed to advance spatial intelligence through large-scale, capability-balanced training. Developed atop high-capacity vision-LLMs, SenseNova-SI demonstrates substantial improvements in spatial reasoning and general multimodal tasks by leveraging a systematically curated dataset and principled training approach. All model weights and training data are made publicly available to facilitate ongoing research and downstream applications (Cai et al., 17 Nov 2025).
1. Model Architectures
The SenseNova-SI family comprises four variants, each derived from different high-parameter backbone architectures:
- Bagel-7B-MoT: A 7 billion parameter unified encoder-decoder architecture capable of jointly modeling vision frames and text tokens, originally trained on mixed vision-language tasks.
- Qwen3-VL-8B: An 8 billion parameter model extended from a large LLM with a vision projection module for vision-language understanding.
- InternVL3-2B and InternVL3-8B: Architectures with 2 and 8 billion parameters, respectively, featuring dual-stream visual encoders that interface with a unified multimodal transformer. Notably, these models are trained from scratch on paired image/text data.
Unlike many prior efforts, all spatial intelligence improvements within SenseNova-SI are attributed exclusively to data scaling and multi-task fine-tuning, without architectural modifications.
2. SenseNova-SI-8M Dataset Construction
Central to SenseNova-SI is the SenseNova-SI-8M spatial QA corpus comprising 8 million image-question-answer pairs, systematically labeled within a five-capability taxonomy:
- Metric Measurement (MM): 1.5M samples
- Spatial Relations (SR): 1.6M samples
- Mental Reconstruction (MR): 0.8M samples
- Perspective-Taking (PT): 2.2M samples
- Comprehensive Reasoning (CR): 1.9M samples
The dataset is constructed as follows:
- 0.6M general visual QA items from VQA, GQA, SPEC, VSR, and IconQA.
- 3.3M items from existing spatial QA benchmarks including CLEVR, Open3D-VQA, REL3D, SAT, GRiD-3D, MultiSpa, MindCube, ViCA, VLM-3R, VSI-590K.
- 4.5M newly generated QA pairs leveraging richly annotated 3D datasets such as MessyTable, ScanNet, ScanNet++, SUN RGB-D, CA-1M, Ego-Exo4D, and Matterport3D. These additions specifically target gaps in Perspective-Taking and Mental Reconstruction.
Each question is tagged according to its capability type to ensure balanced representation across spatial competencies.
3. Training Regimen and Optimization
Training proceeds via one epoch of multi-task supervised fine-tuning for each model on the SenseNova-SI-8M corpus:
- Optimization: AdamW optimizer with a learning rate of , batch size 2048, across 128 GPUs.
- Frame Handling: Up to 16 video frames are sampled per example to capture spatiotemporal cues.
- Loss Function:
with proportional to the inverse frequency of each capability class, counteracting dataset imbalance.
- Regularization: Includes AdamW weight decay, dropout within transformer layers, and early stopping after a single epoch to mitigate overfitting. Incorporation of 0.6M general QA examples serves as rehearsal, alleviating catastrophic forgetting of core 2D visual skills.
4. Benchmark Performance
SenseNova-SI models establish new open-source records on all evaluated spatial intelligence benchmarks while retaining strong general multimodal evaluation scores. The strongest-performing variant, InternVL3-8B, achieves the following:
| Benchmark | SenseNova-SI (InternVL3-8B) | Base InternVL3-8B | Gain |
|---|---|---|---|
| VSI-Bench | 68.7% | 42.1% | +63.2% |
| MMSI-Bench | 43.3% | 28.0% | +54.6% |
| MindCube-Tiny | 85.6% | 41.5% | +106.3% |
| ViewSpatial | 54.6% | 38.6% | +41.5% |
| SITE | 47.7% | 41.1% | +16.1% |
| MMBench-En | 84.9% | 81.7% | +3.9% |
These results demonstrate that the architecture-agnostic data-driven approach yields significant absolute and relative improvements, particularly within complex spatial reasoning tasks.
5. Scaling Effects and Emergent Generalization
- Scaling Law: Performance increases rapidly as training data is scaled from 1M to 8M spatial QA samples, approaching a logarithmic plateau at approximately 6M, i.e.,
where is the number of training samples.
- Emergent Generalization: Fine-tuning solely on certain 3D view-transformation QA subsets (e.g., Ego-Exo4D) leads to unanticipated transfer gains (+10–15 points) on related but unseen tasks like Maze Pathfinding and MMSI camera-pose benchmarks. Training with MessyTable-generated correspondence QA similarly boosts sub-task results on MMSI.
- Frame-Count Extrapolation: InternVL3-8B, trained only up to 16 frames per instance, generalizes robustly to 32–64 frame test settings with negligible degradation, outperforming Cambrian-S-7B even with increased frame budgets.
This suggests that the scaling and diversity of spatial data not only improve direct QA performance but also enable cross-domain transfer and frame count robustness beyond training distribution.
6. Robustness, Debiasing, and Reasoning Analysis
- Debiasing: On VSI-Debiased, SenseNova-SI drops by ~6 points, compared to ~8 for Cambrian-S, indicating stronger dependence on visual signals over linguistic shortcuts.
- Circular Reordering: In MindCube under "hard circular" re-labeling, InternVL3-8B decreases from 85.6% to 75.6% (–10 points), while a prior SFT baseline declines by ~30 points, demonstrating limited reliance on superficial text heuristics.
- Zero-Vision Baseline: Removing images from MindCube QA drops most models to near-random (~50%), but SenseNova-SI retains 52.5%, substantiating genuine visual grounding.
- Chain-of-Thought (CoT) Reasoning: Exploration of three CoT formats (GPT-5–annotated freeform, MindCube-style CogMap, SenseNova-SI CogMap) yields a modest +2–5 point improvement on VSI object-relation tasks, requiring 1–2k additional tokens per sample. This suggests current spatial CoT frameworks offer limited benefit and that entirely new modalities—or structured mechanisms beyond text-based reasoning—may be necessary for future progress.
7. Applications, Resources, and Prospective Directions
- Embodied Manipulation: On the spatial subset of EmbodiedBench, the best SenseNova-SI model improves success rates by approximately 60% over its architecture-matched baseline (from 10–20% to 16–33%) under both official and spatially-oriented prompts, even without downstream fine-tuning.
- Open Source Release: All four model weights and the full 8M-sample SenseNova-SI-8M dataset are made available on HuggingFace for community use and further benchmarking.
- Future Work: Notable avenues include integration of 3D expert encoders (e.g., VGGT), development of spatially structured latent modules, incorporation of graph-based or neural spatial simulators to support richer Chain-of-Thought reasoning in 3D, expansion to video-level spatial planning, broader multi-modal self-supervision in embodied settings, and more granular measurement tasks.
A plausible implication is that future spatial intelligence models may require fundamental shifts in architecture or reasoning paradigms—the present data-driven approach yields marked improvements but also reveals clear saturation points and the limits of current Chain-of-Thought strategies (Cai et al., 17 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free