DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation (2503.06053v1)

Published 8 Mar 2025 in cs.CV and cs.AI

Abstract: Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at https://dropletx.github.io.

Summary

The paper introduces the concept of integral spatio-temporal consistency in video generation, addressing coherence between plot, camera, and prior content.
It presents the DropletVideo-10M dataset, a large-scale collection of 10 million videos with detailed captions specifically designed for training models for this consistency.
The authors also release the DropletVideo model, a pre-trained foundational model built on the dataset, capable of generating videos with integral spatio-temporal consistency and controllable camera movement.

The paper "DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation" (2503.06053) introduces the concept of integral spatio-temporal consistency in video generation, addressing the coherence between plot, camera movements, and the influence of prior content on subsequent frames. To facilitate research, the authors present the DropletVideo-10M dataset and the DropletVideo model.

Key Contributions

The primary contributions of this work are:

The formalization of integral spatio-temporal consistency, emphasizing the interplay between objects and scenes introduced by camera movement in relation to pre-existing elements.
The DropletVideo-10M dataset, a large-scale video dataset designed for training models to achieve integral spatio-temporal consistency, containing videos with both object motion and camera movement, accompanied by detailed captions. The dataset contains 10 million high-quality videos consisting of 2.21 billion frames and captions averaging 206 words in length, explicitly describing motion aspects, including the effects of camera movements.
The DropletVideo model, a pre-trained foundational video generation model built upon the DropletVideo-10M dataset, designed to generate videos exhibiting integral spatio-temporal consistency, including controllable camera movement and plot progression.
The open-sourcing of the dataset, code, and model weights. The dataset is available for academic and non-commercial use under the CC BY-NC-SA 4.0 license.

Methods

The methodology encompasses dataset construction and model development:

Dataset Curation

Raw Video Collection: Videos were sourced from YouTube using keywords selected to capture spatio-temporal variations.
Video Segmentation: An automatic extraction tool, based on optical flow estimation, was developed to detect camera movements and segment videos accordingly.
Video Clip Filtering: A classification model, trained using the Video Swin Transformer, categorized camera motion types, excluding clips with static camera movement and those edited with artificial effects. Aesthetic and image quality scores, using the LAION aesthetics model and DOVER-Technical model, respectively, were used for further filtering.
Video Captioning: A video-to-text model was fine-tuned to generate detailed captions capturing object motion, camera movements, and visual transitions. GPT-4 was used to correct and improve the captions. Fine-tuning was performed on models like InternVL2-8B, ShareGPT4Video-8B, ShareCaptioner-Video, and MA-LMM. The fine-tuned InternVL2-8B was then used for large-scale caption generation.

Model Development (DropletVideo)

A diffusion model architecture was employed, inspired by 3D Variational Autoencoders (VAEs) and incorporating the Multi-Modal Diffusion Transformer (MMDiT) model.
3D Causal VAE: Utilized for encoding and decoding video frames, leveraging 3D convolutions to capture spatial and temporal dimensions, thus ensuring efficiency and continuity of the generated videos.
3D Modality-Expert Transformer: Employed to handle textual prompts and videos, incorporating 3D positional embedding and multi-modal attention to capture dynamic variations and semantic consistency.
Motion Adaptive Generation (MAG): A strategy to dynamically adjust to the desired speed of motion in the generated video content, involving a motion intensity parameter ( $M$ ) to control the motion.

The DropletVideo model was trained using the DropletVideo-10M dataset, employing the Adam optimizer and mixed-precision training with the DeepSpeed framework.

Results

The paper presents both qualitative and quantitative evaluations of the DropletVideo model.

Qualitative Evaluation

The model generates videos where new objects or scenes introduced by camera movement integrate seamlessly with the existing narrative and elements. Examples include camera movements revealing new objects (e.g., a boat appearing on a lake, an apple appearing in a kitchen scene) without disrupting the existing scene. The model exhibits strong 3D consistency, maintaining object details and spatial relationships across different camera angles and rotations (e.g., a snowflake viewed from multiple angles) and the motion control parameter ( $M$ ) allows for precise control over video generation speed and the tempo of visual transitions. The model can generate a variety of camera movements, including trucking, pedestal movement, tilting, dollying, and composite pan-tilt operations.

Quantitative Evaluation

The model was evaluated using VBench++-IST (a revised version of VBench++ with integral spatio-temporal prompts). DropletVideo outperformed the other three models in most performance metrics such as I2V Subject, I2V Background, Motion Smoothness and Camera Motion.

In summary, the paper introduces the DropletVideo-10M dataset and DropletVideo model to address the challenge of integral spatio-temporal consistency in video generation. The dataset and model are made publicly available to facilitate further research in the field.