Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FIFO-Diffusion: Generating Infinite Videos from Text without Training (2405.11473v4)

Published 19 May 2024 in cs.CV and cs.AI
FIFO-Diffusion: Generating Infinite Videos from Text without Training

Abstract: We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which simultaneously processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner frames by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video examples and source codes are available at our project page.

FIFO-Diffusion for Text-to-Video Generation

Overview

This article dives into a novel method for generating long videos from textual descriptions using a technique called FIFO-Diffusion. This approach leverages pretrained diffusion models, typically used for generating shorter video clips, and extends their capability to produce lengthy, high-quality sequences.

Background and Motivation

Diffusion models have shone brightly in the field of generative AI, especially for images. However, video generation using these models brings about unique challenges. Traditional video diffusion models treat videos as 4D tensors, adding a temporal axis to the spatial dimensions, which complicates long video generation. Additionally, commonly adopted chunked autoregressive strategies, where several frames are predicted in parallel, often result in issues like temporal inconsistency and discontinuous motion between separately generated chunks.

FIFO-Diffusion addresses these issues by proposing a method that doesn't require additional training and can generate videos of arbitrary lengths, maintaining high visual quality throughout the generated video.

Key Components

Diagonal Denoising

At the core of FIFO-Diffusion is a technique termed diagonal denoising. This method processes a fixed number of frames, each at progressively noisier levels, in a first-in-first-out manner using a queue.

  1. Queue Handling: The frames are stored in a queue, where each new denoising step removes (dequeues) the most denoised frame and introduces (enqueues) a new random noise frame at the highest noise level.
  2. Sequential Context: Unlike chunked autoregressive methods, FIFO-Diffusion allows each frame to reference preceding frames strongly, preserving temporal consistency better as the video progresses.

Comparing Methods

The left image shows the chunked autoregressive generation, while the right one illustrates FIFO-Diffusion's steady frame progression.

Latent Partitioning

Diagonal denoising, while powerful, can introduce a mismatch between training and inference noise levels. To mitigate this, the paper introduces latent partitioning:

  1. Noise Level Reduction: By increasing the discretization steps and partitioning the frames into multiple blocks, the noise level differences among frames are reduced.
  2. Parallel Processing: This technique also enables parallel denoising across multiple GPUs, improving computational efficiency.

Lookahead Denoising

Lookahead denoising enhances diagonal denoising by allowing noisier frames to benefit from cleaner preceding ones:

  1. Enhanced Accuracy: Empirical evidence shows that frames processed with lookahead denoising achieve more accurate noise predictions.
  2. Increased Computation: It involves an increased computational load, but with parallel processing, this overhead can be effectively managed.

Results and Implications

Qualitative Results

The research showcases extensive qualitative results where FIFO-Diffusion is used to generate videos with thousands of frames, maintaining high visual fidelity and coherent motion throughout.

Generated Long Videos

Examples include scenes like a fireworks display over Sydney Harbour and a penguin colony in Antarctica, all generated at 4K resolution without quality degradation.

Quantitative Metrics

The approach was validated on various established baselines. When contrasted with other methods like FreeNoise and Gen-L-Video, FIFO-Diffusion demonstrated superior motion smoothness and visual quality. Moreover, user studies supported these findings, showing a significant preference for FIFO-Diffusion across various criteria, especially those related to motion accuracy and scene consistency.

Computational Efficiency

FIFO-Diffusion can generate videos of arbitrary length with constant memory usage due to its efficient queue handling. This sets it apart from other methods requiring memory proportional to the video length, a crucial advantage for practical application.

Future Directions

While FIFO-Diffusion significantly mitigates issues in long video generation, it operates under a trade-off introduced by diagonal denoising, which needs further investigation. Future work could integrate diagonal denoising into the training phase, aligning the training and inference environments, potentially improving the method's efficacy even further.

Conclusion

FIFO-Diffusion introduces an innovative way to generate infinitely long, high-quality videos using pretrained text-conditional video models, offering a significant step forward in overcoming the temporal consistency issues of previous methods. Its ability to leverage pretrained models without additional training underscores its practical utility and potential impact across various applications in AI-driven video content creation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  2. VideoCrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  3. VideoCrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  4. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023b.
  5. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
  6. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  7. Flexible diffusion modeling of long videos. In NeurIPS, 2022.
  8. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  9. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  10. Video diffusion models. In NeurIPS, 2022.
  11. CogVideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
  12. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  13. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
  14. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  15. Scalable diffusion models with transformers. In ICCV, 2023.
  16. FreeNoise: Tuning-free longer video diffusion via noise rescheduling. 2023.
  17. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  18. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  19. Make-A-Video: Text-to-video generation without text-video data. In ICLR, 2022.
  20. Denoising diffusion implicit models. In ICLR, 2021a.
  21. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
  22. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.0171, 2018.
  23. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  24. MCVD: Masked conditional video diffusion for prediction, generation, and interpolation. In NeurIPS, 2022.
  25. Gen-L-Video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023a.
  26. ModelScope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
  27. LaVie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
  28. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
  29. Video diffusion models with local-global context guidance. In IJCAI, 2023.
  30. NUWA-XL: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
  31. MagicVideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jihwan Kim (25 papers)
  2. Junoh Kang (4 papers)
  3. Jinyoung Choi (11 papers)
  4. Bohyung Han (86 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com