Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition (2403.14148v1)

Published 21 Mar 2024 in cs.CV and cs.LG

Abstract: Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

The paper introduces a novel approach to video generation using diffusion models, which are generally known for their high computational cost and memory demands when applied directly to high-dimensional video data. The proposed approach, termed the Content-Motion Latent Diffusion Model (CMD), aims to significantly enhance efficiency in video generation tasks by leveraging pretrained image diffusion models. CMD introduces a strategically designed encoding mechanism that separates a video into a content frame, akin to a typical 2D image, and a low-dimensional motion latent vector. This decomposition is pivotal in achieving both computational and memory efficiency, as it allows the utilization of existing well-trained image diffusion models for video content generation.

Methodology

The CMD framework consists of an autoencoder that processes video into a content frame and motion latent representations. The content frame is extracted as a weighted sum of video frames, maintaining high similarity to traditional static images, and enabling the use of pretrained image diffusion models for its generation. The motion representation, on the other hand, encapsulates temporal dynamics in a low-dimensional latent space. This innovative decomposition enables CMD to directly utilize and fine-tune pretrained image diffusion models for generating the content frame, thereby bypassing the need to handle the entire video as a high-dimensional array. Subsequently, a lightweight diffusion model is tasked with generating the motion latent, conditioned on the given content frame.

The training of the diffusion models follows the typical denoising diffusion probabilistic models (DDPM) approach, but uniquely, CMD focuses on modeling the distribution in a compact latent space rather than the higher-dimensional video pixel space. This results in efficient and high-quality video generation while drastically reducing computational overhead.

Results

CMD demonstrates its effectiveness across various video generation benchmarks, yielding significant improvements in both speed and resource utilization. Notably, CMD is reported to sample a 512x1024 resolution video of 16 frames in just 3.1 seconds, operating 7.7 times faster than prior leading methods. Moreover, CMD achieves an FVD score of 238.3 on the WebVid-10M benchmark, a significant 18.5% improvement over the previous state-of-the-art score of 292.4.

Implications and Future Prospects

The implications of CMD are multifold. Practically, CMD provides a framework that significantly enhances the feasibility of deploying video generation systems at scale, by maintaining video quality while reducing both computational resources and time. Theoretical implications suggest a promising direction in the field of diffusion models, indicating potential breakthroughs in how temporal and spatial information can be effectively decoupled in generative models.

Future development might focus on fine-tuning the transfer of visual knowledge from static image models to video domains, further improving the generalization beyond simple scenarios. Additionally, expanding the model capabilities to handle videos of varying lengths and resolutions dynamically without retraining could be a future research pathway, potentially integrating techniques for adaptive latent space modeling.

In conclusion, CMD represents a significant advancement in the field of video generation using diffusion models, optimizing both efficiency and quality by integrating novel video encoding strategies and leveraging existing image model architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Vivit: A video vision transformer. In IEEE International Conference on Computer Vision, 2021.
  3. Stochastic variational video prediction. In International Conference on Learning Representations, 2018.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  5. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  6. LDM3D: Latent diffusion model for 3d. arXiv e-prints, 2023.
  7. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, 2021.
  8. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  9. Instructpix2pix: Learning to follow image editing instructions. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  10. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  11. Maskgit: Masked generative image transformer. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  12. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, 2017.
  13. Stochastic video generation with a learned prior. In International Conference on Machine Learning, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  15. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  16. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, 2016.
  17. StyleVideoGAN: A temporal generative model using a pretrained StyleGAN. arXiv preprint arXiv:2107.07224, 2021.
  18. Stochastic latent residual video prediction. In International Conference on Machine Learning, 2020.
  19. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations, 2023.
  20. Long video generation with time-agnostic VQGAN and time-sensitive transformer. In European Conference on Computer Vision, 2022.
  21. Preserve your own correlation: A noise prior for video diffusion models. In IEEE International Conference on Computer Vision, 2023.
  22. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  23. Latent neural differential equations for video generation. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, 2021.
  24. Deepfake video detection using recurrent neural networks. IEEE International Conference on Advanced Video and Signal Based Surveillance, 2018.
  25. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  26. Flexible diffusion modeling of long videos. In Advances in Neural Information Processing Systems, 2022.
  27. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  28. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  29. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  30. Video diffusion models. In Advances in Neural Information Processing Systems, 2022b.
  31. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In International Conference on Learning Representations, 2023.
  32. Diffusion models for video prediction and infilling. Transactions on Machine Learning Research, 2022.
  33. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, 2018.
  34. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision, 2017.
  35. Text2Performer: Text-driven human video generation. arXiv preprint arXiv:2303.13495, 2023.
  36. Video pixel networks. In International Conference on Machine Learning, 2017.
  37. Analyzing and improving the image quality of StyleGAN. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  38. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022.
  39. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  40. Scalable neural video representations with learnable positional features. In Advances in Neural Information Processing Systems, 2022.
  41. Collaborative score distillation for consistent visual synthesis. In Advances in Neural Information Processing Systems, 2023.
  42. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  43. Videoflow: A conditional flow-based model for stochastic video generation. In International Conference on Learning Representations, 2020.
  44. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
  45. Revisiting hierarchical approach for persistent long-term video prediction. In International Conference on Learning Representations, 2021.
  46. VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023a.
  47. Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023b.
  48. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
  49. Vdt: An empirical study on video diffusion with transformers. arXiv preprint arXiv:2305.13311, 2023.
  50. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020.
  51. Videofusion: Decomposed diffusion models for high-quality video generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  52. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  53. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE International Conference on Computer Vision, 2019.
  54. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  55. Temporal shift GAN for large scale video generation. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2021.
  56. Conditional image-to-video generation with latent flow diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  57. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.
  58. Scalable diffusion models with transformers. In IEEE International Conference on Computer Vision, 2023.
  59. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  60. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020.
  61. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  62. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  63. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  64. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.
  65. Autoregressive latent video prediction with high-fidelity image generator. In IEEE International Conference on Image Processing, 2022.
  66. Make-a-video: Text-to-video generation without text-video data. In International Conference on Learning Representations, 2023.
  67. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  68. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
  69. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
  70. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  71. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning, 2015.
  72. A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, 2021.
  73. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  74. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  75. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.
  76. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representations, 2017.
  77. High fidelity video prediction with large stochastic recurrent neural networks. In Advances in Neural Information Processing Systems, 2019.
  78. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2023.
  79. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  80. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023b.
  81. LAVIE: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
  82. Scaling autoregressive video models. In International Conference on Learning Representations, 2020.
  83. GODIVA: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  84. Nüwa: Visual synthesis pre-training for neural visual world creation. In European Conference on Computer Vision, 2022.
  85. Msr-vtt: A large video description dataset for bridging video and language. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  86. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, 2023.
  87. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157, 2021.
  88. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
  89. Magvit: Masked generative video transformer. In IEEE Conference on Computer Vision and Pattern Recognition, 2023a.
  90. Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, 2022.
  91. Video probabilistic diffusion models in projected latent space. In IEEE Conference on Computer Vision and Pattern Recognition, 2023b.
  92. LION: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems, 2022.
  93. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  94. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sihyun Yu (16 papers)
  2. Weili Nie (41 papers)
  3. De-An Huang (45 papers)
  4. Boyi Li (39 papers)
  5. Jinwoo Shin (196 papers)
  6. Anima Anandkumar (236 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com