OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Published 8 Apr 2026 in cs.CL | (2604.07296v2)

Abstract: Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial -- an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper introduces a novel open-source 3D box-centric engine that boosts spatial intelligence with an absolute average gain of 14.1% and up to 19% improvement over baselines.
The paper details a robust methodology using 3D lifting and scene-graph synthesis to achieve consistent multi-view annotation and precise metric reasoning.
The paper establishes reproducible benchmarks and a comprehensive dataset that empower advancements in embodied AI, robotics, and vision-language systems.

OpenSpatial: An Open Data Engine for Principled Spatial Intelligence

Motivation and Context

Spatial intelligence underpins embodied decision-making, navigation, and robotics, yet prior datasets and data engines have been constrained by domain specificity, limited scale, and closed-source pipelines. As multi-modal LLMs (MLLMs) progress in visual reasoning, their deficits in spatial generalization—precise metric reasoning, multi-view consistency, and holistic scene understanding—persist due to these foundational data limitations. "OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence" (2604.07296) introduces a transparent, 3D box-centric data engine designed to scale spatial understanding with open, diverse, and high-fidelity supervision.

Figure 1: High-level schematic of the OpenSpatial data engine; models trained on OpenSpatial data demonstrate significantly increased spatial intelligence relative to prior baselines.

Methodology: The OpenSpatial Engine

Design Principles

OpenSpatial addresses systemic obstacles in spatial data production through:

3D box-centric grounding: Annotations and supervision are anchored in object-aligned 3D boxes, ensuring geometric fidelity, view invariance, and consistency in metric reasoning.
3D lifting for scalability: The engine includes an automated pipeline to infer high-quality 3D box priors from sparse or in-the-wild sources, bypassing the limits of fully-curated datasets.
Scene-graph-driven synthesis for diversity: Systematic enumeration over objects, attributes, and relations yields a diversified suite of QA tasks, spanning measurement, relationships, camera perception, multi-view, and scene-level reasoning.
Figure 2: Architecture and statistics of the OpenSpatial data engine, spanning data ingestion, processing/annotation, and a breakdown of task/distribution coverage.

The Data Pipeline

The engine ingests multi-view images or video frames, constructs scene-level 3D OBBs (oriented bounding boxes), and projects these to frame-level object attributes via occlusion filtering and depth validation. Detected objects are parametrized as $(x, y, z, x_l, y_l, z_l, r, p, y)$ —encoding 3D position, size, and orientation.

Downstream, consistent object-frame indices facilitate:

Single-view annotation: Scene graphs drive task generation anchored in explicit object referents, enabling object-object, object-environment, and geometric queries.
Multi-view annotation: Cross-view consistency is enforced by associating globally-referenced 3D OBBs across camera perspectives, supporting QA tasks on object identity, pose change, and spatial layout.

The OpenSpatial-3M Dataset

Curated over the engine’s infrastructure, OpenSpatial-3M is a curriculum-style dataset of 3 million samples, encompassing five primary spatial intelligence categories, each with representative sub-tasks:

Figure 3: Overview and representative examples from OpenSpatial-3M across its five foundational task categories: SM, SR, CP, MC, SAR.

Spatial Measurement (SM): Metric estimation of object scale, localization, and inter-object distances.
Spatial Relationship (SR): Topological and directional relations between entities.
Camera Perception (CP): Sensor-centric reasoning for pose, orientation, and ego-motion.
Multi-view Consistency (MC): Associating objects and layouts across disparate viewpoints.
Scene-Aware Reasoning (SAR): Scene-level understanding, navigation, spatial aggregation, and multi-object planning.

The dataset is organized hierarchically to bridge the gap between egocentric observations and globally consistent 3D reasoning.

Experimental Results

OpenSpatial-3M demonstrates significant empirical gains across multiple spatial reasoning benchmarks (BLINK, AllAngles, MMSI, etc.), as well as general multimodal evaluation (MMStar, MMMU). Notable results include:

An absolute average improvement of 14.1% and maximum relative gains up to 19% over strong baselines in spatial intelligence tasks.
Robust performance increases across architectures (e.g., InternVL, Qwen2.5/3, VST), and strong results in new spatial intelligence benchmarks (2604.07296).

Component Analysis and Ablations

Data Generation Modules

Ablation studies confirm the necessity of each module in the engine:

3D box-centric representations outperform point-cloud-centric designs, particularly for metric tasks.
Depth/visibility filters are required to prevent spatial hallucinations and ensure valid supervisory signals.

Data and Model Scaling

Scaling experiments highlight:

Data volume and diversity are positively correlated with 3D-Avg spatial performance, though with decreasing marginal returns at scale.
Larger model architectures (3B to 32B) systematically leverage the expanded dataset, resulting in monotonic improvements across nearly every spatial benchmark.

3D Lifting and Data Source Expansion

Figure 4: Qualitative 3D lifting results – successful annotation of in-the-wild outdoor web data with accurate semantic tags, point cloud recovery, and 3D OBBs.

The novel 3D lifting pipeline allows OpenSpatial to extract densely annotated samples from uncurated web videos, substantially enhancing coverage and scene diversity, particularly for challenging outdoor environments.

Task Diversity

Figure 5: Heatmap analysis of individual and cumulative task contributions reveals that each category (SM, SR, CP, MC, SAR) has a unique and complementary impact on spatial reasoning benchmarks; compositional task integration yields synergistic gains.

The modularity of task synthesis ensures strong complementarity and mitigates "spatial myopia", enabling sustained increases in model robustness as task set complexity grows.

Efficiency

Efficiency upgrades, including parallel execution, message-based pipelining, and feature reuse, have made it computationally feasible to synthesize and process millions of QA pairs at scale.

Figure 6: Detailed efficiency breakdown across key stages of the data engine, reflecting system-level throughput improvements.

Implications and Future Directions

OpenSpatial represents a paradigm shift in spatial intelligence supervision:

For model development: OpenSpatial-3M establishes a reproducible, well-structured benchmark for guided architectural and training advances, supporting data-driven scaling, controlled ablations, and reliable measurement of spatial generalization.
For embodied AI and robotics: The 3D-centric, richly-annotated data foundation is highly compatible with downstream embodied reasoning, agent navigation, and manipulation research that demands reliable spatial perception.
For benchmarking and analysis: Open-source release of both the engine and the dataset lowers the barrier for the community to synthesize new spatial tasks, extend coverage to novel environments, and advance the theoretical understanding of spatial cognition in vision-language systems.

Future advances will likely involve: (1) further unbiased coverage expansion (e.g., complex outdoor, industrial, or simulated scenes), (2) integration with closed-loop embodied environments or physics engines, and (3) meta-analyses of what inductive biases and model capacities are required for holistic spatial generalization.

Conclusion

OpenSpatial provides the first open-source, scalable, and modular infrastructure for principled spatial intelligence supervision. Its rigorous 3D box-grounded paradigm, extensible annotation pipeline, and comprehensive data curation yield substantial gains in spatial reasoning benchmarks and model architectures. By democratizing access to high-quality spatial data, OpenSpatial is positioned to accelerate progress in vision-language modeling, embodied AI, and autonomous robotics research (2604.07296).

Markdown Report Issue