Latte: Latent Diffusion Transformer for Video Generation (2401.03048v1)

Published 5 Jan 2024 in cs.CV

Abstract: We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

PDF HTML Abstract

Introduction to Latte: A Novel Approach to Video Generation

In the field of AI-driven video generation, the development of models capable of creating high-quality videos has remained an intricate challenge. This is largely due to the high-dimensional and complex nature of video content. Yet, recent advances in diffusion models, originally utilized for image generation tasks, suggest a new frontier for video generation. Building on such advancements, a new approach has been introduced: a Latent Diffusion Transformer model, designated as Latte. Latte leverages the potent capabilities of Transformer blocks to encode spatial and temporal video data within a latent space.

Core Principles of Latte

Latte employs a variational autoencoder that converts input videos into a latent space. It then extracts tokens from these transformed features and applies a Transformer, an architecture well-established for capturing long-range dependencies, to these tokens. Recognizing the challenge presented by the number of tokens required to characterize video content, Latte introduces four efficient model variants. Each is designed with a structural approach that smartly decomposes spatial and temporal dimensions. The result addresses the problem of handling a massive volume of tokens without compromising efficiency.

Refining Video Generation with Latte

A comprehensive exploration into the nuances of Transformer-based latent diffusion models for video generation has led to several key findings. Through methodical analysis, optimal practices have been identified, including the methods for video clip patch embedding, the introduction of timestep-class information, the embedding of temporal positional information, and the strategies for learning. By integrating these best practices, Latte is capable of creating photorealistic videos with temporally coherent content, surpassing other methods in performance across various video generation benchmarks.

Assessment and Application

Latte has been scrupulously evaluated on multiple video generation datasets, demonstrating superior performance via Inception Score (IS), Fréchet Video Distance (FVD), and Fréchet Inception Distance (FID) metrics. Not only does Latte excel in creating videos from latent representations, but it has also shown promising capabilities in text-to-video (T2V) generation tasks. Benchmarked against existing T2V models, Latte exhibits competitive results, indicating its versatility in handling diverse video generation applications.

In conclusion, the proposed Latent Diffusion Transformer model, Latte, stands as a significant advancement in video generation, thanks to its strategic use of Transformer architecture in diffusion models. With its state-of-the-art performance and versatile application to T2V tasks, it provides valuable insights and opens new avenues for further research in this rapidly evolving field. The full project, including the data supporting the findings of this paper, is available to the public, encouraging collaboration and innovation within the community.