Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lumiere: A Space-Time Diffusion Model for Video Generation (2401.12945v2)

Published 23 Jan 2024 in cs.CV
Lumiere: A Space-Time Diffusion Model for Video Generation

Abstract: We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Introduction

The paper introduces Lumiere, a novel diffusion model tailored for generating videos from textual descriptions. The model's inception lies in the challenge of synthesizing videos that are not only photorealistic but also exhibit diverse, coherent motion over time. Contrary to prior models that craft videos by rendering distant keyframes and subsequently filling in gaps with temporal super-resolution, Lumiere employs a novel Space-Time U-Net (STUNet) architecture. This architecture allows the generation of an entire video sequence in a single network pass, focusing on both spatial and temporal down- and up-sampling.

Architectural Overview

Lumiere's U-Net-like architecture is distinctive due to its expansive down- and up-sampling operations across space-time dimensions. This structure facilitates the handling of full temporal durations of the video within a single pass of the model. This specific design choice implicitly enables more globally coherent motion in the videos when compared to prior models rooted in cascaded approaches that lacked temporal down-sampling and up-sampling. The absence of cascading temporal super-resolution models from Lumiere's pipeline is a salient feature that markedly differentiates it from its contemporaries.

Technical Contributions

Highlighting the core technical contributions, the authors underscore how Lumiere circumvents the need for temporal super-resolution modules by directly generating low-resolution, full frame-rate videos. The model then undergoes a spatial super-resolution phase, where temporal windows are leveraged, ensuring a coherent synthesis over the entire clip length. This is facilitated by a technique called MultiDiffusion, which addresses potential incoherencies in video segments. Additionally, Lumiere builds upon a pre-existing text-to-image diffusion model, selectively fine-tuning the temporal aspects of the architecture while preserving the pre-trained model's strengths.

Applications and Evaluation

In terms of applications, Lumiere extends beyond simple text-to-video generation to enable image-to-video translation, style-referenced generation, video inpainting, and more. The model's evaluation illustrates its efficacy in generating videos with considerable motion dynamics while maintaining visual quality and staying true to the guiding text prompts. Comparative studies reveal that Lumiere achieves competitive FVD and IS scores on the UCF101 dataset, asserting that it can successfully generate realistic videos that align closely with human perception.

Conclusion

Lumiere establishes a pioneering approach to video generation, overcoming challenges associated with temporal coherency and complexity. Through its innovative design and performance, it sets a new benchmark in the field and opens up possibilities for numerous creative applications, making content creation more accessible and versatile for users at various skill levels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. MultiDiffusion: Fusing diffusion paths for controlled image generation. In ICML, 2023.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  4. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pp.  6299–6308, 2017.
  5. Chen, T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  6. Effectively unbiased FID and Inception Score and where to find them. In CVPR, pp.  6070–6079, 2020.
  7. 3d u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, pp.  424–432. Springer, 2016.
  8. Diffusion models in vision: A survey. IEEE T. Pattern Anal. Mach. Intell., 2023a.
  9. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  10. Shot durations, shot classes, and the increased pace of popular movies, 2015.
  11. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
  12. Breathing life into sketches using text-to-video priors. arXiv preprint arXiv:2311.13608, 2023.
  13. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, pp.  22930–22941, 2023.
  14. Emu Video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  15. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  16. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  17. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  19. Video diffusion models, 2022b.
  20. CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  21. Simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023.
  22. Style transfer by relaxed optimal transport and self-similarity. In CVPR, pp.  10051–10060, 2019.
  23. VideoPoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  24. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  25. Improved denoising diffusion probabilistic models. In ICML, pp.  8162–8171, 2021.
  26. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  27. Pika labs. https://www.pika.art/, 2023.
  28. Resolution dependent GAN interpolation for controllable image synthesis between domains. In Machine Learning for Creativity and Design NeurIPS 2020 Workshop, 2020.
  29. State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204, 2023.
  30. DreamFusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
  31. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
  32. High-resolution image synthesis with latent diffusion models. In CVPR, pp.  10684–10695, 2022.
  33. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, pp.  234–241. Springer, 2015.
  34. RunwayML. Gen-2. https://research.runwayml.com/gen2, 2023.
  35. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp.  1–10, 2022a.
  36. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022b.
  37. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN. Int. J. Comput. Vision, 128(10-11):2586–2606, 2020.
  38. Improved techniques for training GANs. NIPS, 29, 2016.
  39. Make-a-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  40. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pp.  2256–2265, 2015.
  41. StyleDrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  42. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  43. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  44. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pp.  6450–6459, 2018.
  45. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  46. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  47. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  48. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  49. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, pp.  720–736. Springer, 2022.
  50. Inflation with diffusion: Efficient temporal adaptation for text-to-video super-resolution, 2024.
  51. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  52. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023b.
  53. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp.  586–595, 2018.
  54. MagicVideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Omer Bar-Tal (9 papers)
  2. Hila Chefer (14 papers)
  3. Omer Tov (11 papers)
  4. Charles Herrmann (33 papers)
  5. Roni Paiss (12 papers)
  6. Shiran Zada (9 papers)
  7. Ariel Ephrat (12 papers)
  8. Junhwa Hur (20 papers)
  9. Yuanzhen Li (34 papers)
  10. Tomer Michaeli (67 papers)
  11. Oliver Wang (55 papers)
  12. Deqing Sun (68 papers)
  13. Tali Dekel (40 papers)
  14. Inbar Mosseri (20 papers)
  15. Guanghui Liu (12 papers)
  16. Amit Raj (24 papers)
  17. Michael Rubinstein (38 papers)
Citations (143)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com