Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation (2402.13729v4)
Abstract: Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).
- Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12608–12618, 2023.
- Is space-time attention all you need for video understanding? 2021.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
- Pix2video: Video editing using image diffusion. arXiv preprint arXiv:2303.12688, 2023.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
- Pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In CVPR, 2021.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
- Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- Video diffusion models, 2022. URL https://arxiv. org/abs/2204.03458.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
- Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
- Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023.
- Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
- Video pixel networks. 2017.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Video generation from text. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- Vidm: Video implicit diffusion models. In AAAI, 2023.
- To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789–1798, 2017.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023.
- Graf: Generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- First order motion model for image animation. In NeurIPS, 2019.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
- MoCoGAN: Decomposing motion and content for video generation. In CVPR, 2018.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Fvd: A new metric for video generation. In DGS@ICLR, 2019.
- Attention is all you need. In NeurIPS, 2017.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
- Generating videos with scene dynamics. In NeurIPS, 2016.
- Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
- Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
- Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR, 2018.
- VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685, 2023.
- Generating videos with dynamics-aware implicit generative adversarial networks. In The Tenth International Conference on Learning Representations, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Kihong Kim (33 papers)
- Haneol Lee (3 papers)
- Jihye Park (10 papers)
- Seyeon Kim (5 papers)
- Seungryong Kim (103 papers)
- Jaejun Yoo (38 papers)
- KwangHee Lee (4 papers)