Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation (2402.13729v4)

Published 21 Feb 2024 in cs.CV

Abstract: Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
  2. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  3. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12608–12618, 2023.
  4. Is space-time attention all you need for video understanding? 2021.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  6. Pix2video: Video editing using image diffusion. arXiv preprint arXiv:2303.12688, 2023.
  7. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  8. Pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In CVPR, 2021.
  9. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
  10. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  11. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  12. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
  13. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  14. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  15. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  16. Video diffusion models, 2022. URL https://arxiv. org/abs/2204.03458.
  17. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
  18. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
  19. Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023.
  20. Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
  21. Video pixel networks. 2017.
  22. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  23. Video generation from text. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  24. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  25. Vidm: Video implicit diffusion models. In AAAI, 2023.
  26. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789–1798, 2017.
  27. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  28. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023.
  29. Graf: Generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  30. First order motion model for image animation. In NeurIPS, 2019.
  31. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  32. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  33. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
  34. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  35. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  36. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  37. MoCoGAN: Decomposing motion and content for video generation. In CVPR, 2018.
  38. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  39. Fvd: A new metric for video generation. In DGS@ICLR, 2019.
  40. Attention is all you need. In NeurIPS, 2017.
  41. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
  42. Generating videos with scene dynamics. In NeurIPS, 2016.
  43. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
  44. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019.
  45. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  46. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR, 2018.
  47. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157, 2021.
  48. Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685, 2023.
  49. Generating videos with dynamics-aware implicit generative adversarial networks. In The Tenth International Conference on Learning Representations, 2022.
  50. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  51. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kihong Kim (33 papers)
  2. Haneol Lee (3 papers)
  3. Jihye Park (10 papers)
  4. Seyeon Kim (5 papers)
  5. Seungryong Kim (103 papers)
  6. Jaejun Yoo (38 papers)
  7. KwangHee Lee (4 papers)
Citations (1)