VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models (2401.09047v1)

Published 17 Jan 2024 in cs.CV

Abstract: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

PDF Abstract

Introduction to VideoCrafter2

Researchers from Tencent AI Lab have developed a novel approach to overcome the significant challenge of generating high-quality videos using diffusion models. VideoCrafter2 does not depend on large-scale, high-quality video datasets, which are often out of reach for the research community due to accessibility and copyright issues. Instead, their methodology leverages a combination of low-quality videos and synthesized high-quality images to produce videos that are not only of high visual quality but also exhibit accurate text-video alignment.

Methodology Insights

The conventional way of training video models utilizing diffusion techniques involves adding temporal modules to a text-to-image (T2I) backbone model and training with videos. Two distinct approaches manifest as fully training all modules of the video model or only training the temporal modules with the spatial ones fixed. The VideoCrafter2 research delves deep into these models and discerns a stronger coupling between the spatial and temporal modules when fully training the model. This strength allows for more natural motion in videos and provides resilience to subsequent modifications of spatial modules.

Overcoming Data Constraints

The key innovation of VideoCrafter2 lies in decoupling motion and appearance in video generation. This separation allows the model to learn motion dynamics from low-quality video clips, while picture quality and concept accuracy are enhanced through synthesized still images known for their high resolution and complexity. The researchers found that fine-tuning only the spatial dimensions of the fully trained base model with high-quality images was more effective than altering both spatial and temporal dimensions or other strategies evaluated.

Results and Contributions

Comparative analyses show that VideoCrafter2 delivers on multiple fronts, achieving motion consistency and high picture quality through a strategic training pipeline. The work primarily makes three contributions:

It introduces a method to train high-quality video models without the need for high-quality video datasets.
By examining the relationship between spatial and temporal modules, the research identifies key ways to achieve a high-quality video model.
An effective training pipeline is designed based on the insights gained about disentangling appearance from motion at the data level.

Conclusion

In summary, VideoCrafter2 represents a significant step forward in video generation using diffusion models. It demonstrates that by smartly leveraging available resources like low-quality videos and high-quality still images, it is possible to generate videos with impressive visual detail and content accuracy. This approach potentially paves the way for advancements in video generation, making it more accessible for researchers and practitioners by circumventing the need for hard-to-obtain high-quality video data.