Lumiere: A Space-Time Diffusion Model for Video Generation (2401.12945v2)

Published 23 Jan 2024 in cs.CV

Abstract: We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

PDF Abstract

Introduction

The paper introduces Lumiere, a novel diffusion model tailored for generating videos from textual descriptions. The model's inception lies in the challenge of synthesizing videos that are not only photorealistic but also exhibit diverse, coherent motion over time. Contrary to prior models that craft videos by rendering distant keyframes and subsequently filling in gaps with temporal super-resolution, Lumiere employs a novel Space-Time U-Net (STUNet) architecture. This architecture allows the generation of an entire video sequence in a single network pass, focusing on both spatial and temporal down- and up-sampling.

Architectural Overview

Lumiere's U-Net-like architecture is distinctive due to its expansive down- and up-sampling operations across space-time dimensions. This structure facilitates the handling of full temporal durations of the video within a single pass of the model. This specific design choice implicitly enables more globally coherent motion in the videos when compared to prior models rooted in cascaded approaches that lacked temporal down-sampling and up-sampling. The absence of cascading temporal super-resolution models from Lumiere's pipeline is a salient feature that markedly differentiates it from its contemporaries.

Technical Contributions

Highlighting the core technical contributions, the authors underscore how Lumiere circumvents the need for temporal super-resolution modules by directly generating low-resolution, full frame-rate videos. The model then undergoes a spatial super-resolution phase, where temporal windows are leveraged, ensuring a coherent synthesis over the entire clip length. This is facilitated by a technique called MultiDiffusion, which addresses potential incoherencies in video segments. Additionally, Lumiere builds upon a pre-existing text-to-image diffusion model, selectively fine-tuning the temporal aspects of the architecture while preserving the pre-trained model's strengths.

Applications and Evaluation

In terms of applications, Lumiere extends beyond simple text-to-video generation to enable image-to-video translation, style-referenced generation, video inpainting, and more. The model's evaluation illustrates its efficacy in generating videos with considerable motion dynamics while maintaining visual quality and staying true to the guiding text prompts. Comparative studies reveal that Lumiere achieves competitive FVD and IS scores on the UCF101 dataset, asserting that it can successfully generate realistic videos that align closely with human perception.

Conclusion

Lumiere establishes a pioneering approach to video generation, overcoming challenges associated with temporal coherency and complexity. Through its innovative design and performance, it sets a new benchmark in the field and opens up possibilities for numerous creative applications, making content creation more accessible and versatile for users at various skill levels.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (17)

Omer Bar-Tal (9 papers)
Hila Chefer (14 papers)
Omer Tov (11 papers)
Charles Herrmann (33 papers)
Roni Paiss (12 papers)
Shiran Zada (9 papers)
Ariel Ephrat (12 papers)
Junhwa Hur (20 papers)
Yuanzhen Li (34 papers)
Tomer Michaeli (67 papers)
Oliver Wang (55 papers)
Deqing Sun (68 papers)
Tali Dekel (40 papers)
Inbar Mosseri (20 papers)
Guanghui Liu (12 papers)
Amit Raj (24 papers)
Michael Rubinstein (38 papers)

Citations (143)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Lumiere

Tweets

https://twitter.com/_akhaliq/status/1749973922999595369

https://twitter.com/minchoi/status/1750038046433534209

https://twitter.com/omerbartal/status/1749971975852704033

https://twitter.com/shussainasghar/status/1750505204733349898

https://twitter.com/TechXplore_com/status/1750906816890863865

https://twitter.com/sebkrier/status/1750843270597341188