SpatialVID: A Large-Scale Video Dataset with Spatial Annotations (2509.09676v1)

Published 11 Sep 2025 in cs.CV

Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

Summary

The paper introduces SpatialVID as a large-scale video dataset with dense spatial annotations, enabling unified spatiotemporal modeling.
It details a three-stage curation pipeline that filters, annotates, and samples millions of video clips to ensure high geometric and semantic accuracy.
The dataset supports advanced tasks like 3D reconstruction, controllable video generation, and embodied navigation through explicit camera data and dynamic motion insights.

SpatialVID: A Large-Scale Video Dataset with Explicit Spatial Annotations

Motivation and Context

SpatialVID addresses a critical bottleneck in spatial intelligence and world modeling: the lack of large-scale, real-world video datasets with dense geometric and semantic annotations. Existing datasets either provide scale and semantic diversity without explicit 3D information, or offer geometric fidelity but are limited in diversity, scale, and dynamic content. This dichotomy impedes the development of unified models for spatiotemporal reasoning, 3D reconstruction, and controllable video generation. SpatialVID is designed to bridge this gap by providing millions of dynamic, in-the-wild video clips with per-frame camera poses, depth maps, structured captions, and serialized motion instructions.

Dataset Construction and Curation Pipeline

SpatialVID comprises 2.7 million clips (7,089 hours) curated from 21,000+ hours of raw YouTube videos. The curation pipeline consists of three stages: filtering, annotation, and sampling.

Filtering: Raw videos are manually screened for diverse camera motion and scene content. Clips are segmented (3–15s) and filtered using multi-dimensional criteria: aesthetic quality (CLIP+MLP predictor), luminance, OCR-based text interference, and motion intensity (VMAF). Only clips with sufficient parallax, minimal dynamic foreground occlusion, and proper exposure are retained.

Figure 1: Aesthetics Filtering. Clips are scored for visual appeal; only those above threshold are retained for annotation.

Annotation: Geometric annotation is performed using an enhanced MegaSaM pipeline, which integrates monocular depth priors (UniDepth v2, Depth Anything v2), motion probability maps, and uncertainty-aware bundle adjustment. Dynamic object segmentation is refined with SAM2, and dynamic ratio metrics are computed. Motion instructions are derived from temporally smoothed camera pose sequences, mapped to a controlled vocabulary of cinematographic terms. Structured captions are generated via a two-stage VLLM+LLM pipeline (Gemini-2.0-flash, Qwen3-30B-A3B), integrating scene semantics, camera motion, and hierarchical attributes (weather, lighting, time of day).
Sampling: A high-quality subset (SpatialVID-HQ, 1,146 hours) is curated by raising quality thresholds and balancing semantic tags and trajectory statistics, ensuring comprehensive coverage for downstream tasks.

Annotation Modalities and Statistical Properties

SpatialVID provides dense multimodal annotations:

Camera Poses and Depth Maps: Per-frame extrinsics and depth, enabling explicit 3D grounding.
Dynamic Masks: Robust segmentation of moving objects for dynamic scene understanding.
Motion Instructions: Serialized, interpretable motion commands for controllable video generation and navigation.
Structured Captions: Multi-level textual descriptions, including scene summaries, shot-level narratives, and hierarchical semantic tags.

Statistical analysis reveals a broad and balanced distribution of motion directions, scene types, and semantic attributes. Caption length distributions show significant enrichment after spatial enhancement, supporting both high-level and fine-grained reasoning.

Figure 2: Motion caption length distribution. Enhanced captions are longer and more informative, supporting richer supervision.

Figure 3: Scene tags distribution. Sunburst chart shows hierarchical coverage of weather, time, crowd density, lighting, and scene type.

Comparative Analysis and Quality Assessment

SpatialVID is compared against Panda-70M and other spatial datasets. Key findings:

Quality: SpatialVID-HQ exhibits superior consistency and higher average scores in aesthetics, luminance, and motion metrics.
Motion Diversity: Unlike Panda-70M, which is dominated by static videos, SpatialVID features balanced and realistic camera movements, including complex trajectories and turns.
Reconstruction Feasibility: Over 80% of Panda-70M videos are unsuitable for geometric annotation due to insufficient motion, whereas SpatialVID is explicitly curated for reconstructability.

Figure 4: Aesthetics Distribution. SpatialVID-HQ demonstrates a compact, high-quality distribution compared to baseline datasets.

Implications for 3D Vision and World Modeling

SpatialVID establishes a new standard for large-scale, multimodal video datasets with explicit spatial annotations. Its contributions enable:

Scalable Training of 3D-Aware Models: Facilitates data-driven approaches for 3D reconstruction, novel view synthesis, and dynamic scene understanding.
Controllable Video Generation: Supports training of models with explicit camera and motion control, advancing physically grounded world simulation.
Embodied Agents and Navigation: Provides dense supervision for navigation, exploration, and interaction in real-world scenarios.
Unified Spatial Intelligence: Bridges the gap between semantic-rich video corpora and geometry-centric datasets, fostering the development of unified spatiotemporal models.

Limitations and Future Directions

While MegaSaM provides robust geometric annotation, limitations remain in handling extreme dynamic scenes, variable focal lengths, and radial distortion. Future integration of advanced camera estimators (e.g., ViPE) and depth models will further enhance annotation quality. The dataset's scale and diversity open avenues for research in physically grounded video generation, embodied AI, and interactive world modeling.

Conclusion

SpatialVID delivers a large-scale, richly annotated video dataset that unifies semantic and geometric information for dynamic, real-world scenes. Its procedural curation pipeline, dense multimodal annotations, and balanced sampling strategy set a new benchmark for spatial intelligence research. SpatialVID is poised to catalyze advances in 3D reconstruction, controllable video generation, and embodied agents, laying the foundation for future work in physically grounded, interactive world simulation.