Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition (2403.14148v1)

Published 21 Mar 2024 in cs.CV and cs.LG

Abstract: Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

PDF HTML Abstract

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

The paper introduces a novel approach to video generation using diffusion models, which are generally known for their high computational cost and memory demands when applied directly to high-dimensional video data. The proposed approach, termed the Content-Motion Latent Diffusion Model (CMD), aims to significantly enhance efficiency in video generation tasks by leveraging pretrained image diffusion models. CMD introduces a strategically designed encoding mechanism that separates a video into a content frame, akin to a typical 2D image, and a low-dimensional motion latent vector. This decomposition is pivotal in achieving both computational and memory efficiency, as it allows the utilization of existing well-trained image diffusion models for video content generation.

Methodology

The CMD framework consists of an autoencoder that processes video into a content frame and motion latent representations. The content frame is extracted as a weighted sum of video frames, maintaining high similarity to traditional static images, and enabling the use of pretrained image diffusion models for its generation. The motion representation, on the other hand, encapsulates temporal dynamics in a low-dimensional latent space. This innovative decomposition enables CMD to directly utilize and fine-tune pretrained image diffusion models for generating the content frame, thereby bypassing the need to handle the entire video as a high-dimensional array. Subsequently, a lightweight diffusion model is tasked with generating the motion latent, conditioned on the given content frame.

The training of the diffusion models follows the typical denoising diffusion probabilistic models (DDPM) approach, but uniquely, CMD focuses on modeling the distribution in a compact latent space rather than the higher-dimensional video pixel space. This results in efficient and high-quality video generation while drastically reducing computational overhead.

Results

CMD demonstrates its effectiveness across various video generation benchmarks, yielding significant improvements in both speed and resource utilization. Notably, CMD is reported to sample a 512x1024 resolution video of 16 frames in just 3.1 seconds, operating 7.7 times faster than prior leading methods. Moreover, CMD achieves an FVD score of 238.3 on the WebVid-10M benchmark, a significant 18.5% improvement over the previous state-of-the-art score of 292.4.

Implications and Future Prospects

The implications of CMD are multifold. Practically, CMD provides a framework that significantly enhances the feasibility of deploying video generation systems at scale, by maintaining video quality while reducing both computational resources and time. Theoretical implications suggest a promising direction in the field of diffusion models, indicating potential breakthroughs in how temporal and spatial information can be effectively decoupled in generative models.

Future development might focus on fine-tuning the transfer of visual knowledge from static image models to video domains, further improving the generalization beyond simple scenarios. Additionally, expanding the model capabilities to handle videos of varying lengths and resolutions dynamically without retraining could be a future research pathway, potentially integrating techniques for adaptive latent space modeling.

In conclusion, CMD represents a significant advancement in the field of video generation using diffusion models, optimizing both efficiency and quality by integrating novel video encoding strategies and leveraging existing image model architectures.

PDF Markdown Bookmark Chat (Pro)

References (94)

Authors (6)

Sihyun Yu (16 papers)
Weili Nie (41 papers)
De-An Huang (45 papers)
Boyi Li (39 papers)
Jinwoo Shin (196 papers)
Anima Anandkumar (236 papers)

Citations (11)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1770999135555956830

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition (2403.14148v1)