VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation (2303.08320v4)

Published 15 Mar 2023 in cs.CV

Abstract: A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.

PDF Abstract

VideoFusion: Advanced Diffusion Models for Video Generation

The paper "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" introduces an innovative approach to video synthesis using diffusion probabilistic models (DPMs). While DPMs have achieved significant progress in image generation tasks, their application to video generation presents challenges due to the increased dimensionality of video data and the need for temporal consistency across frames. This research addresses these challenges by proposing a decomposed diffusion process aimed at improving the quality and coherence of generated videos.

Overview of the Method

The paper introduces a novel decomposed diffusion structure where noise added to video frames is divided into two components: a base noise shared across consecutive frames and a residual noise that varies across the temporal dimension. The base noise represents the shared content across frames, whereas the residual noise captures the dynamic aspects. This decomposed approach facilitates the learning of temporal dependencies between frames by reducing the complexity the model must handle at each step.

Furthermore, the authors leverage pretrained image diffusion models, thus allowing VideoFusion to benefit from image priors and reducing the computational cost typically associated with training high-dimensional models from scratch. This is achieved by utilizing large image-generative models to predict base noise with a single forward pass, effectively integrating existing image synthesis capabilities into the video generation framework.

Strong Numerical Results

The paper includes comprehensive quantitative evaluations demonstrating the effectiveness of VideoFusion. On several benchmark datasets, such as UCF101, Sky Time-lapse, and TaiChi-HD, VideoFusion outperforms existing state-of-the-art methods. One significant achievement is a Fréchet Video Distance (FVD) of 139 on UCF101 at a resolution of $16\times64\times64$ , significantly better than previous models. These results underscore the method's ability to maintain high visual quality and temporal coherence, illustrating its advantages over both GAN-based and traditional diffusion models.

Implications and Future Directions

The proposed model opens new pathways for enhancing video generation by efficiently leveraging shared content across frames. This method has potential implications for video prediction, interpolation, and texture synthesis, where the preservation of content continuity is critical. By integrating pretrained image models, the approach simplifies the learning process and sets a foundation for further exploring the combination of image and video generation.

Looking forward, VideoFusion's capability to generate consistent long video sequences by retaining base noise across frames invites further exploration into applications requiring narrative consistency over extended periods, such as story-driven animation or content generation for virtual environments. Additionally, the method's adaptability to different dataset characteristics suggests its utility in domain-specific scenarios, where the balance between shared and residual noise can be tuned for optimal performance.

In conclusion, the proposed decomposed diffusion framework forms a robust basis for advancing video generation technologies. By addressing the inherent challenges of coherence and high dimensionality, the paper provides a compelling solution that extends the impact of diffusion models from static to dynamic data, fostering future research in the evolving landscape of generative models.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zhengxiong Luo (16 papers)
Dayou Chen (6 papers)
Yingya Zhang (43 papers)
Yan Huang (180 papers)
Liang Wang (512 papers)
Yujun Shen (111 papers)
Deli Zhao (66 papers)
Jingren Zhou (198 papers)
Tieniu Tan (119 papers)

Citations (238)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos