Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (2311.15127v1)

Published 25 Nov 2023 in cs.CV
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Abstract: We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .

Introduction to Video Synthesis Techniques

Generative models have made significant strides in synthesizing high-quality images, and the pursuit to extend this success to video generation has been a focal area of research. Video generative models have typically evolved from their image-based counterparts, with researchers modifying existing architectures by introducing temporal layers and adjusting training regimes. However, the influence of the training data and its curation has been somewhat overlooked, even though it's widely acknowledged that the data distribution profoundly impacts generative model performance. This paper tackles these unexplored aspects and presents a method for scaling latent video diffusion models to large datasets with a focus on text-to-video and image-to-video applications.

Systematic Approach to Data Curation

The paper begins by dissecting the video training process into three critical stages: text-to-image pretraining, video pretraining, and high-quality video finetuning. It posits that the pretraining phase must occur on a well-curated dataset—a dataset distilled from a large unfiltered collection to remove clips with limited motion and other unwanted characteristics. Through empirical analysis, the authors show that pretraining on such refined datasets leads to substantial improvements that carry over even after the finetuning stage.

Innovations in Video Diffusion Models

The core of the presented Stable Video Diffusion (SVD) approach lies in its robust base model trained on approximately 600 million video clips. This model acts as a springboard for further task-specific finetuning—for text-to-video generation, for instance, human evaluators favored the results over current state-of-the-art methods. Not only does SVD handle direct text-to-video synthesis effectively, but it also adapts to image-to-video generation where sequences are generated from a single image input, demonstrating the model's potent motion understanding.

Expanding into Multi-View and 3D Spaces

One of the paper's pivotal claims is the model's ability to serve as a multi-view 3D-prior. After finetuning on appropriate datasets, the SVD showcases its proficiency in generating multiple consistent views of an object in a feedforward fashion, achieving superior performance to several specialized techniques, while also requiring significantly less computational resources. Additionally, it introduces the adaptability to control motion through camera motion-specific LoRA modules, underscoring the model's versatility.

Conclusion and Implications

The authors conclude by affirming the importance of data curation and training strategy segmentation for video diffusion models. They present SVD as a generative video model that not only excels in high-resolution text-to-video and image-to-video synthesis but also sets new standards in multi-view consistency and efficiency in generative video modeling. With code and model weights publicly released, the authors invite further exploration and adoption of their findings in the broader video research community. This transparency ensures that SVD's contributions to the generative video modeling field will continue to foster innovation and refinement in AI-powered video synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (116)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12608–12618, 2023.
  3. A general language assistant as a laboratory for alignment, 2021.
  4. Stochastic variational video prediction. In International Conference on Learning Representations, 2018.
  5. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9365–9374, 2019.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  7. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022.
  8. ipoke: Poking a still image for controlled stochastic video synthesis. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021.
  9. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023.
  10. Generating long videos of dynamic scenes. In NeurIPS, 2022.
  11. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  12. Improved conditional vrnns for video prediction. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  13. Emu: Enhancing image generation models using photogenic needles in a haystack, 2023.
  14. Objaverse-XL: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  15. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023b.
  16. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023.
  17. Stochastic video generation with a learned prior. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 2018.
  18. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021.
  19. Stochastic image-to-video synthesis using cinns. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 2021.
  20. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  21. Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 1978.
  22. Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020.
  23. Structure and content-guided video synthesis with diffusion models, 2023.
  24. Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. pages 363–370, 2003.
  25. Stylevideogan: A temporal generative model using a pretrained stylegan. In British Machine Vision Conference (BMVC), 2021.
  26. Stochastic latent residual video prediction. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  27. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  28. Long video generation with time-agnostic vqgan and time-sensitive transformer. In Computer Vision – ECCV 2022, pages 102–118, Cham, 2022. Springer Nature Switzerland.
  29. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  30. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  31. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv preprint arXiv:2309.03549, 2023.
  32. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  33. Rv-gan: Recurrent gan for unconditional video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2024–2033, 2022.
  34. Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023.
  35. Latent video diffusion models for high-fidelity long video generation, 2023.
  36. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  37. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022.
  38. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  39. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
  40. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022a.
  41. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022b.
  42. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022c.
  43. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022.
  44. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  45. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  46. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005.
  47. Openclip, 2021.
  48. Itseez. Open source computer vision library. https://github.com/itseez/opencv, 2015.
  49. Shap-e: Generating conditional 3d implicit functions, 2023.
  50. Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020.
  51. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022.
  52. Text2video-zero: Text-to-image diffusion models are zero-shot video generators, 2023.
  53. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  54. Pika Labs. Pika labs, https://www.pika.art/, 2023.
  55. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
  56. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023.
  57. Zero-1-to-3: Zero-shot one image to 3d object, 2023a.
  58. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  59. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  60. Transformation-based adversarial video prediction on large-scale data. ArXiv, 2020.
  61. On distillation of guided diffusion models, 2023.
  62. Point-e: A system for generating 3d point clouds from complex prompts, 2022.
  63. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  64. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952, 2023.
  65. Training contrastive captioners. LAION blog, 2023.
  66. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021.
  67. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  68. Aditya Ramesh. How dall·e 2 works, 2022.
  69. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022a.
  70. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022b.
  71. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021a.
  72. High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752, 2021b.
  73. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015.
  74. RunwayML. Gen-2 by runway, https://research.runwayml.com/gen2, 2023.
  75. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
  76. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  77. Temporal generative adversarial nets with singular value clipping. In ICCV, 2017.
  78. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 2020.
  79. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022.
  80. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  81. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  82. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022.
  83. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, 2022.
  84. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015.
  85. Understanding and mitigating copying in diffusion models, 2023.
  86. Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020.
  87. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020.
  88. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  89. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  90. A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, 2021.
  91. Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006(146):10, 2006.
  92. Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems, 2021.
  93. Decomposing motion and content for natural video sequence prediction. ICLR, 2017.
  94. Phenaki: Variable length video generation from open domain textual description. arXiv:2210.02399, 2022.
  95. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. In (NeurIPS) Advances in Neural Information Processing Systems, 2022.
  96. Generating videos with scene dynamics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
  97. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  98. G3an: Disentangling appearance and motion for video generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  99. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
  100. Internvid: A large-scale video-text dataset for multimodal understanding and generation, 2023c.
  101. Novel view synthesis with diffusion models, 2022.
  102. Scaling autoregressive video models. In International Conference on Learning Representations, 2020.
  103. Godiva: Generating open-domain videos from natural descriptions. arXiv:2104.14806, 2021.
  104. Nüwa: Visual synthesis pre-training for neural visual world creation. In European Conference on Computer Vision, pages 720–736. Springer, 2022.
  105. Demystifying clip data, 2023.
  106. Msr-vtt: A large video description dataset for bridging video and language. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  107. Videogpt: Video generation using vq-vae and transformers, 2021.
  108. Coca: Contrastive captioners are image-text foundation models, 2022a.
  109. Keunwoo Peter Yu. Videoblip. https://github.com/yukw777/VideoBLIP, 2023. If you use VideoBLIP, please cite it as below.
  110. Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, 2022b.
  111. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9150–9161, 2023.
  112. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
  113. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023a.
  114. Controlvideo: Training-free controllable text-to-video generation, 2023b.
  115. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  116. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Andreas Blattmann (15 papers)
  2. Tim Dockhorn (13 papers)
  3. Sumith Kulal (8 papers)
  4. Daniel Mendelevitch (1 paper)
  5. Maciej Kilian (3 papers)
  6. Dominik Lorenz (6 papers)
  7. Yam Levi (3 papers)
  8. Zion English (4 papers)
  9. Vikram Voleti (25 papers)
  10. Adam Letts (3 papers)
  11. Varun Jampani (125 papers)
  12. Robin Rombach (24 papers)
Citations (605)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com