Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (2303.13439v1)

Published 23 Mar 2023 in cs.CV

Abstract: Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

PDF Abstract

Analysis of "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators"

The paper "Text2Video-Zero" addresses the task of zero-shot text-to-video generation, leveraging pre-trained diffusion models traditionally used for text-to-image synthesis. This work emerges as a solution for generating temporally consistent videos without the substantial computational overhead usually associated with video data training.

Methodology

The authors propose to utilize stable pre-trained text-to-image diffusion models, specifically Stable Diffusion (SD), adapting them for video synthesis by introducing motion dynamics into the latent codes and implementing cross-frame attention. These modifications allow the method to maintain the temporal coherence necessary for video content without requiring extensive video data for training.

Motion Dynamics in Latent Codes: The method integrates motion dynamics by enriching the latent codes with motion information. This process involves sampling the initial latent code and applying global translation vectors to encode motion consistency across frames. This enrichment ensures that generated sequences preserve global scene and background consistency.
Cross-Frame Attention: By replacing self-attention layers with cross-frame attention in the UNet architecture, the method improves the modeling of temporal coherence and maintains the identity of foreground objects across frames. This technique is critical for preserving object appearance, context, and identity.
Background Smoothing: An optional background smoothing technique is introduced to enhance temporal consistency further by applying a convex combination of background-masked latent codes, ensuring that background elements remain muted across frames.

Applications

The versatility of the proposed method extends beyond text-to-video generation. The authors incorporate ControlNet for conditional generation tasks, allowing video synthesis guided by pose, edge, or depth information without additional training. Furthermore, integration with Video Instruct-Pix2Pix supports instruction-guided video editing, showcasing the adaptability of the approach to various video-related tasks.

Experimental Results

The paper presents several experimental evaluations demonstrating the effectiveness of Text2Video-Zero across different settings, such as unconditional text-to-video generation, and conditional generation with edge and pose guidance. In comparisons to state-of-the-art methods, such as CogVideo and Tune-A-Video, the proposed approach achieves competitive CLIP scores for text-video alignment and superior temporal consistency.

Implications

The implications of this work are twofold. Practically, the ability to generate videos from text prompts without requiring video data training democratizes access to video synthesis technology, potentially lowering the barrier to entry for video content generation. Theoretically, the exploration of cross-domain applications for diffusion models opens avenues for future research into leveraging pre-trained models for tasks across different media modalities.

Future Directions

Future research could explore the extension of this method to higher resolution videos or the inclusion of more complex motion dynamics. Additionally, integrating more sophisticated attention mechanisms could further improve the fidelity and coherence of the generated content.

In conclusion, "Text2Video-Zero" effectively demonstrates the potential of pre-trained text-to-image models in the video domain, providing a flexible and computationally efficient framework for zero-shot text-to-video generation.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Levon Khachatryan (2 papers)
Andranik Movsisyan (1 paper)
Vahram Tadevosyan (3 papers)
Roberto Henschel (8 papers)
Zhangyang Wang (374 papers)
Shant Navasardyan (10 papers)
Humphrey Shi (97 papers)

Citations (416)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Picsart-AI-Research/Text2Video-Zero: [ICCV 2023 Oral] Text-to-Image Diffusion Models are Zero-Shot Video Generators (3,863 stars)

Tweets

YouTube

Show All Videos