FIFO-Diffusion: Generating Infinite Videos from Text without Training (2405.11473v4)

Published 19 May 2024 in cs.CV and cs.AI

Abstract: We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which simultaneously processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner frames by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video examples and source codes are available at our project page.

PDF HTML Abstract

FIFO-Diffusion for Text-to-Video Generation

Overview

This article dives into a novel method for generating long videos from textual descriptions using a technique called FIFO-Diffusion. This approach leverages pretrained diffusion models, typically used for generating shorter video clips, and extends their capability to produce lengthy, high-quality sequences.

Background and Motivation

Diffusion models have shone brightly in the field of generative AI, especially for images. However, video generation using these models brings about unique challenges. Traditional video diffusion models treat videos as 4D tensors, adding a temporal axis to the spatial dimensions, which complicates long video generation. Additionally, commonly adopted chunked autoregressive strategies, where several frames are predicted in parallel, often result in issues like temporal inconsistency and discontinuous motion between separately generated chunks.

FIFO-Diffusion addresses these issues by proposing a method that doesn't require additional training and can generate videos of arbitrary lengths, maintaining high visual quality throughout the generated video.

Key Components

Diagonal Denoising

At the core of FIFO-Diffusion is a technique termed diagonal denoising. This method processes a fixed number of frames, each at progressively noisier levels, in a first-in-first-out manner using a queue.

Queue Handling: The frames are stored in a queue, where each new denoising step removes (dequeues) the most denoised frame and introduces (enqueues) a new random noise frame at the highest noise level.
Sequential Context: Unlike chunked autoregressive methods, FIFO-Diffusion allows each frame to reference preceding frames strongly, preserving temporal consistency better as the video progresses.

Comparing Methods

The left image shows the chunked autoregressive generation, while the right one illustrates FIFO-Diffusion's steady frame progression.

Latent Partitioning

Diagonal denoising, while powerful, can introduce a mismatch between training and inference noise levels. To mitigate this, the paper introduces latent partitioning:

Noise Level Reduction: By increasing the discretization steps and partitioning the frames into multiple blocks, the noise level differences among frames are reduced.
Parallel Processing: This technique also enables parallel denoising across multiple GPUs, improving computational efficiency.

Lookahead Denoising

Lookahead denoising enhances diagonal denoising by allowing noisier frames to benefit from cleaner preceding ones:

Enhanced Accuracy: Empirical evidence shows that frames processed with lookahead denoising achieve more accurate noise predictions.
Increased Computation: It involves an increased computational load, but with parallel processing, this overhead can be effectively managed.

Results and Implications

Qualitative Results

The research showcases extensive qualitative results where FIFO-Diffusion is used to generate videos with thousands of frames, maintaining high visual fidelity and coherent motion throughout.

Generated Long Videos

Examples include scenes like a fireworks display over Sydney Harbour and a penguin colony in Antarctica, all generated at 4K resolution without quality degradation.

Quantitative Metrics

The approach was validated on various established baselines. When contrasted with other methods like FreeNoise and Gen-L-Video, FIFO-Diffusion demonstrated superior motion smoothness and visual quality. Moreover, user studies supported these findings, showing a significant preference for FIFO-Diffusion across various criteria, especially those related to motion accuracy and scene consistency.

Computational Efficiency

FIFO-Diffusion can generate videos of arbitrary length with constant memory usage due to its efficient queue handling. This sets it apart from other methods requiring memory proportional to the video length, a crucial advantage for practical application.

Future Directions

While FIFO-Diffusion significantly mitigates issues in long video generation, it operates under a trade-off introduced by diagonal denoising, which needs further investigation. Future work could integrate diagonal denoising into the training phase, aligning the training and inference environments, potentially improving the method's efficacy even further.

Conclusion

FIFO-Diffusion introduces an innovative way to generate infinitely long, high-quality videos using pretrained text-conditional video models, offering a significant step forward in overcoming the temporal consistency issues of previous methods. Its ability to leverage pretrained models without additional training underscores its practical utility and potential impact across various applications in AI-driven video content creation.