FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (2310.15169v3)

Published 23 Oct 2023 in cs.CV

Abstract: With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.

PDF Abstract

FreeNoise: Advancing Video Diffusion Models with Efficient Noise Rescheduling

In the presented paper, the authors tackle the challenges associated with generating longer text-to-video content using video diffusion models which traditionally struggle with extending beyond a limited number of frames due to training constraints. The manuscript introduces a novel approach termed "FreeNoise," which aims to augment the capabilities of pre-trained video diffusion models to generate lengthy video sequences while maintaining high content consistency and computational efficiency.

Core Contributions

The paper focuses on two main avenues of improvement for video diffusion models:

Noise Rescheduling for Extended Video Generation: The authors identify a critical shortcoming in existing video diffusion models, which underperform when tasked with generating extended video sequences. Traditional models are typically trained on a fixed number of frames, creating a significant gap during inference for longer sequences. The FreeNoise approach emphasizes a method termed ‘noise rescheduling,’ where initial noise is rescheduled using a local noise shuffle strategy combined with window-based attention fusion. This novel strategy orchestrates noise frames in a way that sustains coherence over extended sequences, rectifying the mismatch between training and inference pipelines without retraining the model.
Multi-Prompt Motion Injection: Another innovative aspect of FreeNoise is its capability to handle multiple text prompts, a necessity for video content that needs to evolve over time while maintaining continuity. The paper introduces a "motion injection" technique, which leverages the diffusion model's inherent multistage denoising process to inject new motions at strategic time steps. This allows the generation of sequences with substantial motion while managing transitions between various video segments dictated by separate text prompts.

Methodology and Insights

Temporal Modeling and Attention Fusion: The research dissects the temporal modeling mechanism in video diffusion models, pinpointing the influential role of initial noise. The authors propose that rescheduling a sequence of initial noises allows the model to seamlessly extend video lengths without significant deviation from training conditions. They further utilize a window-based fusion method, ensuring that temporal attention calculations within the model's U-Net architecture remain computationally efficient and manageable.
Computational Efficiency: A notable advantage of FreeNoise is its computational efficiency compared to existing methods for long video generation. In particular, FreeNoise adds only a minimal increase in inference time (approximately 17%), a substantial improvement over competing methods that incur significant performance overheads.
Evaluation and Results: Extensive experiments validate the effectiveness of FreeNoise, with metrics such as Frechet Video Distance (FVD) and CLIP Similarity demonstrating the improved quality and consistency of generated videos. FreeNoise significantly outperforms baseline methods, achieving lower FVD and higher content coherence scores, underscoring the method's ability to generate high-fidelity video content over extended temporal periods.

Implications and Future Directions

The introduction of FreeNoise represents a significant advancement in the capabilities of video generation models. By efficiently addressing the constraints of pre-trained diffusion models through noise rescheduling and motion injection, the approach sets a benchmark for future developments in generating dynamic, extended video sequences under varying conditions. The ability to maintain content consistency and computational efficiency will likely inspire additional research into more adaptive and extensive generation methodologies. Further exploration could delve into integrating FreeNoise with newer base models or optimizing it for diverse applications in digital media, entertainment, and virtual reality content creation.

In conclusion, the proposed FreeNoise framework not only enhances the scalability and efficacy of existing video diffusion models but also opens new avenues for creative and practical implementations in the rapidly evolving field of AI-driven video content generation.