Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models (2305.10474v3)

Published 17 May 2023 in cs.CV, cs.GR, and cs.LG

Abstract: Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

PDF Abstract

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

The paper "Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models" proposes an innovative solution to the task of generating temporally consistent and high-quality videos using diffusion models. Building upon the significant advancements in image synthesis, the authors tackle the unique challenges posed by video generation, such as the necessity for temporal coherence among sequential frames and the computational intensiveness associated with video data.

Core Contributions and Methodology

The primary contribution of this work is the introduction of a new noise prior tailored for video diffusion models. Existing approaches to video generation using diffusion models predominantly adapt image synthesis techniques, extending pixel-wise independent noise priors to videos. However, this naive extension fails to capture the intrinsic temporal correlations present in videos, leading to sub-optimal video synthesis performance. The authors highlight this limitation by demonstrating spatial-temporal correlation of noise maps across video frames, validating that these correlations are crucial for preserving video dynamics.

Correlated Noise Models: The authors introduce two strategies—mixed and progressive noise models—to incorporate spatial-temporal correlations into video frame generation:

Mixed Noise Model involves generating noise maps by combining shared and individual components for each frame. This approach injects correlation while maintaining individual noise characteristics.
Progressive Noise Model generates noise in an autoregressive fashion, using prior frame noise as a basis for subsequent frame generation. This model more accurately reflects natural video progression by building upon previously synthesized frames.

Architecture and Implementation

The paper describes an efficient cascade of models designed for text-to-video synthesis. It involves a base model for initial low-resolution video generation, followed by temporal interpolation and spatial super-resolution stacks. Each model leverages temporal attention mechanisms and 3D convolutions to enhance temporal consistency. The architecture reuses components from eDiff-I, a high-performing text-to-image diffusion model, optimizing computational efficiency and transfer learning capabilities.

Experimental Validation

To substantiate their approach, the authors conduct extensive experiments in both small-scale and large-scale settings:

Small-Scale Experiments: On the UCF-101 dataset, their model achieves superior scores in Inception Score (IS) and Fréchet Video Distance (FVD), outperforming state-of-the-art models with significantly fewer parameters.
Large-Scale Experiments: On both UCF-101 and MSR-VTT datasets, the proposed model sets new benchmarks by obtaining improved zero-shot generation scores, demonstrated through both qualitative and quantitative analyses.

Implications and Future Work

The introduction of correlated noise priors represents a significant step forward in video diffusion modeling, offering a more efficient training paradigm that effectively leverages existing models trained on image data. The strategy not only enhances video quality and temporal coherency but also substantially reduces computational costs.

Looking ahead, the paper suggests exploration into dynamic hyperparameter tuning for noise correlation ratios and further refinement of model components to balance quality and diversity. The potential applications of such robust video synthesis models span various domains, including content creation, augmented reality, and video editing.

In conclusion, the advancements presented in this research redefine the capabilities of diffusion models for video synthesis, providing a fertile ground for further exploration into efficient and scalable video generation methodologies that harness the power of correlated noise modeling.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Songwei Ge (24 papers)
Seungjun Nah (17 papers)
Guilin Liu (78 papers)
Tyler Poon (3 papers)
Andrew Tao (40 papers)
Bryan Catanzaro (123 papers)
David Jacobs (36 papers)
Jia-Bin Huang (106 papers)
Ming-Yu Liu (87 papers)
Yogesh Balaji (22 papers)

Citations (193)

View on Semantic Scholar

Related Papers

Find Related Papers

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models (2305.10474v3)