Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation (2402.13729v4)

Published 21 Feb 2024 in cs.CV

Abstract: Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (51)

Authors (7)

Kihong Kim (33 papers)
Haneol Lee (3 papers)
Jihye Park (10 papers)
Seyeon Kim (5 papers)
Seungryong Kim (103 papers)
Jaejun Yoo (38 papers)
KwangHee Lee (4 papers)

Citations (1)

View on Semantic Scholar

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation (2402.13729v4)

Related Papers