RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements (2504.08212v1)

Published 11 Apr 2025 in cs.CV

Abstract: Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.

Summary

RealCam-Vid: Advancing Video Dataset Development with Dynamic Scenes and Metric-scale Camera Movements

The paper "RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements" presents a significant advancement in the development of video datasets designed for applications in camera-controllable video generation. Authored by Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li from Zhejiang University, the paper introduces RealCam-Vid, a dataset designed to overcome prevalent limitations in existing video datasets. The innovative dataset is fully open-source and available at the specified GitHub repository.

Limitations of Existing Datasets

Existing video datasets such as RealEstate10K have notable constraints primarily due to their static scene content and a lack of metric-scale camera annotations. These constraints limit their utility in generating realistic video simulations where accurate camera motion and dynamic scene interactions are essential. Moreover, while certain datasets manage dynamic scenes, they often lack the critical metric-scale geometric consistency required to synthesize authentic object motions and camera trajectories in complex environments.

Enhancements in Dataset and Data-Processing Pipeline

The RealCam-Vid dataset fills the gap by providing high-resolution videos that incorporate both dynamic scenes and metric-scale camera parameters. The authors have designed a robust data-processing pipeline which enables the integration of diverse scene dynamics along with precise camera trajectories. This dataset collection and refinement process moves through several notable stages:

Video Clip Splitting: Utilizing the split operator from Koala-36M, the authors' method effectively detects scene boundaries by analyzing temporal coherence in feature embeddings, outperforming traditional scene-cut detection techniques like PySceneDetect. This approach ensures temporal continuity and prevents unnecessary discontinuities that may disrupt optical flow consistency.
Motion Intensity Filtering: Aimed at filtering static sequences, this mechanism uses CoTracker for keypoint trajectory analysis, ensuring that clips exhibit sufficient dynamic motion. This step is crucial for avoiding model degeneration associated with static camera motions and is rigorously applied with empirical validation.
Caption Granularity: For capturing the nuanced correlation between video content and text, the authors use CogVLM2-Caption to produce granular captions. This model allows for granular control over the temporal context, significantly enhancing the caption quality compared to more conventional methods.
Dynamic Scene Camera Annotation: The MonST3R algorithm is leveraged to achieve reliable pose estimation in dynamic scenes. This characterization is crucial for differentiating between static and moving objects, eliminating the dependency on per-object motion modeling, and minimizing annotation requirements.
Metric Scale Alignment: The integration of depth and disparity maps facilitates precise metric alignment, addressing complications arising from heterogeneous video source scales. The paper presents rigorous optimization techniques to achieve consistent metric-scale alignment.

Implications and Future Directions

The advancements introduced in RealCam-Vid carry profound implications for generative models in video synthesis. By bridging the gap between static and dynamic scene capabilities and introducing metric-scale accuracy, the dataset enhances the fidelity and applicability of simulation models for real-world applications. This improvement in dataset utility could stimulate further research in AI, particularly in sectors dependent on realistic motion synthesis, such as autonomous vehicle navigation, VR environments, and interactive video technologies.

Future developments may aim to refine the scale alignment processes and dynamic scene annotations even further, minimizing computational overhead while maximizing accuracy. As camera-controllable video generation remains a vibrant research domain, RealCam-Vid offers invaluable resources that will shape future innovations in dynamic scene understanding and photorealistic video generation.