Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FreeInit: Bridging Initialization Gap in Video Diffusion Models (2312.07537v2)

Published 12 Dec 2023 in cs.CV

Abstract: Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tianxing Wu (24 papers)
  2. Chenyang Si (36 papers)
  3. Yuming Jiang (73 papers)
  4. Ziqi Huang (20 papers)
  5. Ziwei Liu (368 papers)
Citations (34)

Summary

The Challenge of Temporal Consistency in Video Generation

Video generation models have made leaps forward, producing more realistic and complex video content. However, a persistent challenge remains - achieving temporal consistency in videos. It is critical for the subjects and scenes within the frames to maintain visual and semantic coherence over time.

Unveiling an Implicit Gap

An inherent discrepancy in initial noise frequency distribution during the training and inference of video diffusion models is at the heart of the problem. Put simply, the spatial-temporal frequency characteristics of the initial noise at inference diverge from that during training. This discord is particularly pronounced in the low-frequency components.

Introducing FreeInit

Responding to this issue, FreeInit presents a method to enhance temporal consistency explicitly during the inference stage. With no need for additional training, FreeInit iteratively refines the initial noise, specifically targeting its low-frequency components. This iterative refinement allows for a harmonization of the training and inference processes, significantly bolstering the appearance and consistency of the video frames generated.

Empirical Validation

The method's efficiency is borne out in comprehensive experiments that were run across a variety of text-to-video generation models and text prompts. FreeInit has consistently demonstrated notable improvements in the quality of the generated videos by stipulating minor tweaks to the frequency filter parameters for each model. Moreover, a user paper has eloquently illustrated a significant preference for videos enhanced by FreeInit, underscoring the method's efficacy from a qualitative perspective.

Future Implications

While the strategy heralds a notable advancement, it is not free from constraints. FreeInit is an iterative process and thus, comes with increased computational loads and inference times. However, solutions such as Coarse-to-Fine Sampling are suggested to alleviate these concerns. Despite these challenges, FreeInit stands as a promising stride towards generating more temporally consistent videos directly from textual descriptions - an achievement with vast potentials ranging from media and entertainment to training and simulation applications.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com