Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data (2408.10119v1)

Published 19 Aug 2024 in cs.CV and cs.AI

Abstract: Text-to-video (T2V) generation has gained significant attention due to its wide applications to video generation, editing, enhancement and translation, \etc. However, high-quality (HQ) video synthesis is extremely challenging because of the diverse and complex motions existed in real world. Most existing works struggle to address this problem by collecting large-scale HQ videos, which are inaccessible to the community. In this work, we show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning. We factorize the whole T2V generation process into two steps: generating an image conditioned on a highly descriptive caption, and synthesizing the video conditioned on the generated image and a concise caption of motion details. Specifically, we present \emph{Factorized-Dreamer}, a factorized spatiotemporal framework with several critical designs for T2V generation, including an adapter to combine text and image embeddings, a pixel-aware cross attention module to capture pixel-level image information, a T5 text encoder to better understand motion description, and a PredictNet to supervise optical flows. We further present a noise schedule, which plays a key role in ensuring the quality and stability of video generation. Our model lowers the requirements in detailed captions and HQ videos, and can be directly trained on limited LQ datasets with noisy and brief captions such as WebVid-10M, largely alleviating the cost to collect large-scale HQ video-text pairs. Extensive experiments in a variety of T2V and image-to-video generation tasks demonstrate the effectiveness of our proposed Factorized-Dreamer. Our source codes are available at \url{https://github.com/yangxy/Factorized-Dreamer/}.

PDF HTML Abstract

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

The paper "Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data" by Tao Yang, Yangming Shi, Yunwen Huang, Feng Chen, Yin Zheng, and Lei Zhang, introduces a novel approach to text-to-video (T2V) generation leveraging publicly available, limited, and low-quality (LQ) datasets. This approach is significant given the inherent challenges in video generation, particularly concerning the temporal dimension and the complexity of motion in natural scenes.

Summary of the Approach

The authors propose a factorized spatiotemporal framework named Factorized-Dreamer, which divides the T2V generation process into two distinct stages: (1) generation of a static image from a detailed text prompt using a text-to-image (T2I) model, and (2) transformation of the generated image into a video sequence conditioned on a concise motion description. This division simplifies the challenging task of video generation by effectively leveraging the strengths of existing large-scale T2I models.

Key Components of Factorized-Dreamer

T2I Adapter: This module combines image and text embeddings to enhance spatial quality. The adapter updates the computation of cross attention by integrating features from both text and image embeddings.
Pixel-Aware Cross Attention (PACA): PACA captures pixel-level details by incorporating latent image features. When integrated with the T2I adapter, it allows the model to utilize both global and pixel-level information effectively.
T5 Text Encoder: Replacing the CLIP text encoder, the T5 text encoder offers improved motion understanding due to its comprehensive training on vast datasets, which include a variety of textual descriptions related to motion.
PredictNet: This auxiliary module supervises optical flow, ensuring the generated video exhibits coherent motion. Its integration during the final training phase is crucial for enhancing motion consistency.

Training and Noise Scheduling

The training strategy is thoughtfully designed, beginning with low-resolution video training followed by high-resolution video refinement, and ultimately incorporating PredictNet for motion supervision. The researchers also address the noise scheduling problem, which is critical in diffusion models. They employ a shifting and rescaling procedure to align SNR optimally, thus reducing residual signals at terminal diffusion steps and ensuring stable and high-quality video generation.

Empirical Evaluations

Text-to-Video Generation: Factorized-Dreamer demonstrates competitive performance on the EvalCrafter benchmark, particularly in visual quality, motion quality, and temporal consistency. Its ability to synthesize aesthetic and coherent videos comparable to models trained on proprietary datasets affirms the efficacy of the factorized approach.

Image-to-Video Generation: The model also excels in image-to-video (I2V) tasks, achieving high frame and prompt consistency. The user studies further validate its superior performance in visual quality and motion coherence.

Numerical Results

EvalCrafter Benchmark: Factorized-Dreamer achieves a sum score of 251, only slightly behind the leading commercial method Gen2. Notably, it outperforms all open-source competitors in visual quality and motion quality metrics.
Zero-shot Results on UCF101: The model demonstrates competitive FVD and IS scores, indicating robust performance in zero-shot settings.

Implications and Future Directions

Factorized-Dreamer shows that high-quality video generation is feasible using publicly accessible datasets, a significant step toward democratizing AI. The approach alleviates the need for expensive, large-scale, high-quality video datasets and recaptioning efforts. Future research can explore integrating more sophisticated temporal modeling frameworks to handle long video generation and further enhancements in motion consistency.

Conclusion

Factorized-Dreamer represents a methodical advance in T2V generation without the dependence on extensive high-quality datasets. By leveraging a factorized generation approach, novel adaptation techniques, and sophisticated noise scheduling, it achieves high-quality results in both T2V and I2V tasks. This work paves the way for more accessible and efficient video generation methodologies, promising potential applications in video editing, enhancement, and translation.

References:

The detailed list of references confirming the method's coherence with prior works, advancements in T2I and T2V models, and the theoretical underpinnings of diffusion models and noise scheduling techniques are outlined comprehensively in the paper.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tao Yang (520 papers)
Yangming Shi (7 papers)
Yunwen Huang (2 papers)
Feng Chen (261 papers)
Yin Zheng (23 papers)
Lei Zhang (1689 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1825725116040487135

https://twitter.com/arXivGPT/status/1826352343270535671