- The paper introduces the concept of integral spatio-temporal consistency in video generation, addressing coherence between plot, camera, and prior content.
- It presents the DropletVideo-10M dataset, a large-scale collection of 10 million videos with detailed captions specifically designed for training models for this consistency.
- The authors also release the DropletVideo model, a pre-trained foundational model built on the dataset, capable of generating videos with integral spatio-temporal consistency and controllable camera movement.
The paper "DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation" (2503.06053) introduces the concept of integral spatio-temporal consistency in video generation, addressing the coherence between plot, camera movements, and the influence of prior content on subsequent frames. To facilitate research, the authors present the DropletVideo-10M dataset and the DropletVideo model.
Key Contributions
The primary contributions of this work are:
- The formalization of integral spatio-temporal consistency, emphasizing the interplay between objects and scenes introduced by camera movement in relation to pre-existing elements.
- The DropletVideo-10M dataset, a large-scale video dataset designed for training models to achieve integral spatio-temporal consistency, containing videos with both object motion and camera movement, accompanied by detailed captions. The dataset contains 10 million high-quality videos consisting of 2.21 billion frames and captions averaging 206 words in length, explicitly describing motion aspects, including the effects of camera movements.
- The DropletVideo model, a pre-trained foundational video generation model built upon the DropletVideo-10M dataset, designed to generate videos exhibiting integral spatio-temporal consistency, including controllable camera movement and plot progression.
- The open-sourcing of the dataset, code, and model weights. The dataset is available for academic and non-commercial use under the CC BY-NC-SA 4.0 license.
Methods
The methodology encompasses dataset construction and model development:
Dataset Curation
- Raw Video Collection: Videos were sourced from YouTube using keywords selected to capture spatio-temporal variations.
- Video Segmentation: An automatic extraction tool, based on optical flow estimation, was developed to detect camera movements and segment videos accordingly.
- Video Clip Filtering: A classification model, trained using the Video Swin Transformer, categorized camera motion types, excluding clips with static camera movement and those edited with artificial effects. Aesthetic and image quality scores, using the LAION aesthetics model and DOVER-Technical model, respectively, were used for further filtering.
- Video Captioning: A video-to-text model was fine-tuned to generate detailed captions capturing object motion, camera movements, and visual transitions. GPT-4 was used to correct and improve the captions. Fine-tuning was performed on models like InternVL2-8B, ShareGPT4Video-8B, ShareCaptioner-Video, and MA-LMM. The fine-tuned InternVL2-8B was then used for large-scale caption generation.
Model Development (DropletVideo)
- A diffusion model architecture was employed, inspired by 3D Variational Autoencoders (VAEs) and incorporating the Multi-Modal Diffusion Transformer (MMDiT) model.
- 3D Causal VAE: Utilized for encoding and decoding video frames, leveraging 3D convolutions to capture spatial and temporal dimensions, thus ensuring efficiency and continuity of the generated videos.
- 3D Modality-Expert Transformer: Employed to handle textual prompts and videos, incorporating 3D positional embedding and multi-modal attention to capture dynamic variations and semantic consistency.
- Motion Adaptive Generation (MAG): A strategy to dynamically adjust to the desired speed of motion in the generated video content, involving a motion intensity parameter (M) to control the motion.
The DropletVideo model was trained using the DropletVideo-10M dataset, employing the Adam optimizer and mixed-precision training with the DeepSpeed framework.
Results
The paper presents both qualitative and quantitative evaluations of the DropletVideo model.
Qualitative Evaluation
The model generates videos where new objects or scenes introduced by camera movement integrate seamlessly with the existing narrative and elements. Examples include camera movements revealing new objects (e.g., a boat appearing on a lake, an apple appearing in a kitchen scene) without disrupting the existing scene. The model exhibits strong 3D consistency, maintaining object details and spatial relationships across different camera angles and rotations (e.g., a snowflake viewed from multiple angles) and the motion control parameter (M) allows for precise control over video generation speed and the tempo of visual transitions. The model can generate a variety of camera movements, including trucking, pedestal movement, tilting, dollying, and composite pan-tilt operations.
Quantitative Evaluation
The model was evaluated using VBench++-IST (a revised version of VBench++ with integral spatio-temporal prompts). DropletVideo outperformed the other three models in most performance metrics such as I2V Subject, I2V Background, Motion Smoothness and Camera Motion.
In summary, the paper introduces the DropletVideo-10M dataset and DropletVideo model to address the challenge of integral spatio-temporal consistency in video generation. The dataset and model are made publicly available to facilitate further research in the field.