SenseNova-SI-8M: A Large-Scale Multimodal QA Dataset

Updated 19 November 2025

SenseNova-SI-8M is a large-scale multimodal dataset comprising approximately 8 million curated image-question-answer pairs designed to enhance spatial intelligence in AI models.
It organizes data using a rigorous five-capability taxonomy—metric measurement, spatial relations, mental reconstruction, perspective-taking, and comprehensive reasoning—ensuring balanced task representation.
The dataset integrates diverse sources from existing 2D/3D VQA datasets and algorithmically generated QA pairs, providing robust benchmarking and fine-tuning resources for multimodal architectures.

SenseNova-SI-8M is a large-scale multimodal dataset comprising approximately eight million image-question-answer (QA) pairs, specifically curated to advance the spatial intelligence capabilities of foundation models. The corpus is structured according to a rigorous taxonomy encompassing five fundamental spatial intelligence capabilities, supporting comprehensive benchmarking and fine-tuning of multimodal architectures. SenseNova-SI-8M integrates diverse data sources, balancing existing community datasets with algorithmically generated samples derived from densely annotated 3D environments, thereby addressing deficiencies in task representation and facilitating robust model development (Cai et al., 17 Nov 2025).

1. Corpus Structure and Composition

SenseNova-SI-8M consists solely of QA triples, in which each item includes an image (or short video), a question, and its corresponding answer. The compilation and stratification are as follows:

General 2D VQA pairs (~0.6 M): Sourced from existing 2D visual QA datasets, including VSR, SPEC, GQA, VQA, and IconQA.
Community-sourced spatial QA pairs (~3.3 M): Aggregated from spatially focused datasets—Open3D-VQA, CLEVR-series, REL3D, SAT, GRID-3D, MultiSpa, MindCube, ViCA, VLM-3R, and VSI-590K.
Newly generated 3D/multi-view QA pairs (~4.5 M): Automatically synthesized using richly annotated 3D datasets—MessyTable, ScanNet, ScanNet++, SUN RGB-D, CA-1M, Ego-Exo4D, and Matterport3D—through projection of ground-truth meshes and camera poses into image space, and template-based instantiation of spatial relations.

The reported total is approximately 8 million, although a direct sum yields approximately 8.4 million; the discrepancy is unaddressed. No train/validation/test splits or per-category sample counts are detailed.

Source Category	Examples (Millions)	Origin/Datasets
General 2D VQA	~0.6	VSR, SPEC, GQA, VQA, IconQA
Community Spatial QA	~3.3	Open3D-VQA, CLEVR-series, MindCube, etc.
Newly Generated 3D/Multi-view	~4.5	ScanNet, Matterport3D, SUN RGB-D, etc.

2. Taxonomy of Spatial Capabilities

The dataset is explicitly organized according to a five-capability taxonomy, referred to as the EASI scheme [(Cai et al., 17 Nov 2025), cite [15]]. Each capability subsumes a family of fine-grained tasks:

Metric Measurement (MM): Estimation of real-world scales (distances, sizes, inter-object measures).
Spatial Relations (SR): Inferring positional relationships within a 3D coordinate frame at both egocentric (left–right, up–down, front–back) and allocentric (near–far, large–small) levels.
Mental Reconstruction (MR): Inferring hidden or occluded 3D structure from 2D views, including prediction of visible surfaces and occluded geometry.
Perspective-taking (PT): Reasoning across viewpoint shifts; includes view correspondence (matching across occlusions), camera motion reasoning (relative pose inference), and allocentric transformation (restating relationships across coordinate systems).
Comprehensive Reasoning (CR): Multi-step reasoning that combines multiple spatial skills (MM, SR, MR, PT). CR representation remains minimal, leveraging existing datasets primarily.

Relative sample distribution among categories is visually accessible via bar graphs (Fig. 2) but exact counts are absent.

3. Data Sources, Modalities, and Alignment

Images and video frames are drawn from public 2D VQA resources and 3D/dense-reconstruction datasets such as ScanNet and Matterport3D. Each sample consists of either an image or video plus one question-answer pair; no additional modalities (e.g., depth maps, point clouds) are distributed. 3D and multi-view annotations serve solely to algorithmically generate QA pairs by projection or template instantiation. Explicitly, depth and scene annotations remain internal and are not released.

Alignment between questions, answers, and visual content relies on algorithmic synthesis:

For newly generated samples (~4.5 M), QAs are derived using ground-truth meshes and structured scene geometry.
No manual annotation process or post-generation human validation is documented.

4. Annotation and Curation Methodology

The annotation process for QA generation within the 3D-derived subset is entirely algorithmic. Operations reference ground-truth object meshes, camera parameters, and spatial volumes for scene understanding and question synthesis. Manual annotators, agreement statistics, and explicit validation protocols are not mentioned. The inclusion pipeline admits all public QA pairs from listed datasets (~3.9 M) and complements underrepresented tasks (MR, PT) with synthetic QAs generated to address observed imbalances.

No formal diversity metrics, exclusion criteria, or object-class histograms are provided. The approach to task balancing identifies metric measurement (MM) and spatial relations (SR) as dominant and augments minimal categories (MR, PT) via synthetic sample generation for their respective sub-tasks.

5. Quantitative Analysis and Statistical Considerations

The technical report does not supply tabulated distributional statistics, sample counts per capability, or derived dataset-level metrics. However, standard analyses in analogous contexts employ metrics such as:

Category Proportion: $p_c = N_c / N_{total}$ for $c \in \{\mathrm{MM}, \mathrm{SR}, \mathrm{MR}, \mathrm{PT}, \mathrm{CR}\}$
Shannon Entropy: $H = -\sum_{c} p_c \log p_c$ to assess diversity
Class Imbalance Ratio: $I = \max_c N_c / \min_c N_c$

These formulae are not present in the paper, but are canonical for evaluating distributional balance; a plausible implication is that users should calculate these themselves when conducting secondary analyses.

Relative to existing spatial intelligence corpora, SenseNova-SI-8M expands both the scale and scope:

Size: 8 M (SenseNova-SI-8M) vs. Cambrian-S/590 K, VST/4.1 M, SpatialLadder/26 K, SpatialVLM/2 B template QAs.
Coverage: Encompasses all five EASI taxonomy capabilities, notably including Perspective-taking (PT), which previous datasets understudied.
Taxonomical Organization: Represents the first large corpus structured explicitly around the EASI five-capability scheme.

Most prior datasets prioritize SR and MM, whereas SenseNova-SI-8M extends representation to MR and PT via targeted synthesis.

7. Downstream Use, Benchmarking, and Accessibility

SenseNova-SI-8M serves exclusively as a training and fine-tuning resource for multimodal foundation models (e.g., Qwen3-VL, InternVL3, Bagel). Models derived from training on SenseNova-SI-8M (SenseNova-SI_*) are evaluated on five held-out spatial intelligence benchmarks:

VSI-Bench: Video spatial reasoning over sampled frames.
MMSI-Bench: Multi-image spatial information integration.
MindCube: Limited-view mental reconstruction (MindCube-Tiny test split).
ViewSpatial-Bench: Multi-perspective localization tasks.
SITE: Over 30 abstract spatial reasoning tasks.

Evaluation protocols adhere to official benchmark prompt formats and scoring methods (see Table 1 and Sec. 4.1 in the cited report). The dataset itself is not partitioned for independent testing.

SenseNova-SI-8M, associated codebases, model checkpoints, and inference recipes (including prompts) are distributed via GitHub and HuggingFace:

Dataset license is not explicitly stipulated; a plausible implication is default adherence to SenseTime Research open-source licensing. Download instructions are specified in the GitHub README, facilitating script-based retrieval of QA JSON files and image indices.

SenseNova-SI-8M marks a significant advance in spatial intelligence training resources, systematically integrating diverse QA pairs mapped to a five-capability taxonomy, algorithmically generating underrepresented spatial tasks, and providing comprehensive downstream evaluation infrastructure, as detailed in (Cai et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Scaling Spatial Intelligence with Multimodal Foundation Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SenseNova-SI-8M Dataset.