REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents (2411.13552v2)

Published 20 Nov 2024 in cs.CV

Abstract: Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents based on a content image. Towards this goal, we design an image-conditioned VAE to encode a video to an extremely compressed motion latent space. This magic Reducio charm enables 64x reduction of latents compared to a common 2D VAE, without sacrificing the quality. Training diffusion models on such a compact representation easily allows for generating 1K resolution videos. We then adopt a two-stage video generation paradigm, which performs text-to-image and text-image-to-video sequentially. Extensive experiments show that our Reducio-DiT achieves strong performance in evaluation, though trained with limited GPU resources. More importantly, our method significantly boost the efficiency of video LDMs both in training and inference. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU. Code released at https://github.com/microsoft/Reducio-VAE .

PDF HTML Abstract

REDUCIO! Generating 1024 $\times$ 1024 Video within 16 Seconds using Extremely Compressed Motion Latents

The paper "REDUCIO! Generating 1024 $\times$ 1024 Video within 16 Seconds using Extremely Compressed Motion Latents" introduces an innovative approach to improving the efficiency of video generation models using a highly compressed motion latent space. The authors propose a new framework named Reducio, which leverages a novel image-conditioned variational autoencoder (VAE) to drastically reduce the dimensions of the video latent space and a corresponding diffusion model, Reducio-DiT, to generate high-quality, high-resolution videos with impressive computational efficiency.

Key Contributions

Video Compression via Reducio-VAE: The paper introduces the Reducio-VAE, which differs significantly from conventional approaches by employing an aggressive compression strategy that leverages the inherent redundancy present in video data. By focusing on encoding minimal motion information while maintaining a quality content frame, the VAE compresses the latent space by a factor of 64 compared to standard 2D VAEs, achieving an overall down-sampling factor of 4096. Notably, this approach manages to outperform existing models by a significant margin in image reconstruction quality, as demonstrated by a remarkable 5dB increase in PSNR over common 2D VAEs.
Two-Stage Video Generation: The paper adopts a two-stage generation process, wherein the initial stage involves generating a content image using a text-to-image model, and the subsequent stage extends this content into a video. This method benefits from the spatial priors learned by state-of-the-art image diffusion models, significantly enhancing the generation quality while relying on fewer computational resources.
Efficient Diffusion Transformer (Reducio-DiT): The authors employ a diffusion transformer, Reducio-DiT, which integrates the highly compressed latent representations from Reducio-VAE, along with incorporating an additional image-condition module to inject semantic and spatial content information. This significantly improves the model's efficiency, achieving a speedup of 16.6 times over the Lavie model for 1024 $\times$ 1024 videos, and does so with only 3.2K A100 GPU hours.

Findings and Implications

High-Efficiency with Quality: Reducio-DiT demonstrates competitive performance in generating high-resolution videos with lower computational costs. The paper reports a rapid generation time of 15.5 seconds for 16-frame videos at 1024 $\times$ 1024 resolution on a single A100 GPU, setting a new benchmark for efficiency in the domain of video diffusion models. This level of efficiency could make high-fidelity video generation more accessible to a broader range of applications in industries that require rapid prototyping or real-time video synthesis, such as virtual reality, gaming, or advertising.
Potential for Broader Applications: The method's reliance on compact latent spaces and factorized spatial and temporal modeling holds promise for various real-world applications where processing speed and resource use are critical constraints. Such applications could greatly benefit from the Reducio framework due to its ability to yield high-quality results much faster than existing models.

Speculations on Future Developments

The Reducio framework opens up several avenues for future exploration. One potential direction could be further research into even more aggressive compression techniques, perhaps drawing on ideas from video codec methods to refine both spatial and temporal components. Another area could involve adapting the framework to progressively longer video sequences where preserving temporal coherence and content fidelity across frames is essential. Additionally, integrating complementary acceleration strategies, such as rectified flow, could further amplify the model's efficiency and expand its applicability to larger datasets or more complex generative tasks.

In conclusion, this paper introduces a well-thought-out approach to video generation that significantly cuts down computational costs while upholding high quality, marking a substantial stride towards more accessible and efficient video synthesis techniques.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Rui Tian (27 papers)
Qi Dai (58 papers)
Jianmin Bao (65 papers)
Kai Qiu (19 papers)
Yifan Yang (578 papers)
Chong Luo (58 papers)
Zuxuan Wu (144 papers)
Yu-Gang Jiang (223 papers)

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/Reducio-VAE (30 stars)

Tweets

https://twitter.com/gm8xx8/status/1859442890570695025

https://twitter.com/WesRothMoney/status/1859732436869054583

https://twitter.com/arxivsanitybot/status/1859788375097848067

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents (2411.13552v2)