Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models (2405.04233v1)

Published 7 May 2024 in cs.CV and cs.LG

Abstract: We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.

PDF Abstract

Overview of "Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models"

The paper "Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models" introduces a novel model for text-to-video generation, named Vidu, which employs diffusion models with a U-ViT backbone. The integration of U-ViT enhances the model's ability to handle long video sequences, offering a scalable solution to overcome the temporal limitations of previous models. Vidu stands out for generating 1080p videos up to 16 seconds in duration within a single process, exhibiting capabilities in both realistic and imaginative video productions.

Technical Contributions

Vidu leverages diffusion models, historically noted for advancements in generating high-fidelity images and data, and extends their application to handle video data effectively. By utilizing U-ViT, Vidu captures complex visual features with an advanced tokenization approach, where temporal, spatial, and contextual data are integrated as tokens. This architectural choice capitalizes on the transformer's inherent strength in modeling long sequences. The video autoencoder further optimizes spatial and temporal compression, enhancing Vidu's training and inference efficiency.

The model's training process relies on a sophisticated strategy involving vast datasets of text-video pairs. With manual labeling being infeasible, the authors employ a video captioner to automatically annotate training samples. Re-captioning techniques adapted from prevailing literature are utilized during inference to reformulate user inputs into model-optimized prompts, thus enhancing Vidu's interpretative accuracy.

Results and Performance

Vidu's proficiency is underscored through various illustrative samplings, such as videos demonstrating 3D consistency with cohesive transitions and emotional expressions. Notably, the model also competes with Sora, another eminent text-to-video generator, and manages to deliver comparable outputs without public accessibility for direct benchmarking.

Additionally, Vidu is adaptable to other video generation tasks, such as canny-to-video conversion and video prediction, using fine-tuning methodologies like DreamBooth for subject-driven generation. These auxiliary applications further corroborate Vidu's robustness and versatility across alternate contexts.

Implications and Future Directions

The development of Vidu represents a significant progression in the capabilities of text-to-video generation models. It offers new possibilities for applications where dynamic and contextually accurate video content is required, such as in media production, interactive storytelling, and virtual reality experiences. The advancements made in Vidu, including its scalability and ability to manage extended video sequences, open avenues for further research in enhancing detail fidelity and subject interaction within generated videos.

The authors acknowledge areas for improvement, pointing out occasional inconsistencies in finer details and interaction physics in generated outputs. These challenges pave the way for possible future explorations into scaling and optimizing the model for even more accurate and realistic video outputs.

In conclusion, the work on Vidu highlights key advancements in text-to-video generation, utilizing diffusion models and U-ViT backbones for producing high-definition videos with improved temporal and spatial coherence. Future research may focus on addressing the identified limitations and exploring the broader implications of such technology in diverse applications.