Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AnimateDiff-Lightning: Cross-Model Diffusion Distillation (2403.12706v1)

Published 19 Mar 2024 in cs.CV and cs.AI

Abstract: We present AnimateDiff-Lightning for lightning-fast video generation. Our model uses progressive adversarial diffusion distillation to achieve new state-of-the-art in few-step video generation. We discuss our modifications to adapt it for the video modality. Furthermore, we propose to simultaneously distill the probability flow of multiple base diffusion models, resulting in a single distilled motion module with broader style compatibility. We are pleased to release our distilled AnimateDiff-Lightning model for the community's use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708–1718, 2021.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023.
  4. Structure and content-guided video synthesis with diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023.
  5. Generative adversarial networks. Communications of the ACM, 63:139 – 144, 2014.
  6. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2024.
  7. Gaussian error linear units (gelus), 2023.
  8. Imagen video: High definition video generation with diffusion models, 2022.
  9. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  10. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  11. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, 2024.
  12. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5404–5411, January 2024.
  13. Sdxl-lightning: Progressive adversarial diffusion distillation, 2024.
  14. Robust high-resolution video matting with temporal guidance. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3132–3141, 2021.
  15. Diffusion model with perceptual loss, 2024.
  16. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
  17. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  18. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2024.
  19. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
  20. Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023.
  21. Lcm-lora: A universal stable-diffusion acceleration module, 2023.
  22. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models, 2023.
  23. Pytorch: An imperative style, high-performance deep learning library, 2019.
  24. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
  25. Searching for activation functions, 2017.
  26. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  27. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
  28. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  29. Adversarial diffusion distillation, 2023.
  30. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023.
  31. Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations, 2024.
  32. Consistency models. In International Conference on Machine Learning, 2023.
  33. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  34. Towards accurate generative models of video: A new metric & challenges. ArXiv, abs/1812.01717, 2018.
  35. Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning, 2024.
  36. Magicvideo-v2: Multi-stage high-aesthetic video generation, 2024.
  37. Group normalization. International Journal of Computer Vision, 128:742 – 755, 2018.
  38. Holistically-nested edge detection. International Journal of Computer Vision, 125:3 – 18, 2015.
  39. Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023.
  40. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models, 2023.
  41. One-step diffusion with distribution matching distillation, 2023.
  42. Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023.
  43. Trajectory consistency distillation, 2024.
  44. Magicvideo: Efficient video generation with latent diffusion models, 2023.
  45. AbsoluteReality v1.8.1. https://civitai.com/models/81458.
  46. Counterfeit v3.0. https://civitai.com/models/4468.
  47. DreamShaper v8. https://civitai.com/models/4384.
  48. DynaVision v2. https://civitai.com/models/75549.
  49. epiCRealism. https://civitai.com/models/25694.
  50. Exquisite Details Art. https://civitai.com/models/118495.
  51. IMP v1.0. https://civitai.com/models/56680.
  52. MajicMix Realistic v7. https://civitai.com/models/43331.
  53. MajicMix Reverie v1. https://civitai.com/models/65055.
  54. Mistoon Anime v1.0. https://civitai.com/models/24149.
  55. RCNZ Cartoon 3d v2. https://civitai.com/models/66347.
  56. Realistic Vision v5.1. https://civitai.com/models/4201.
  57. ReV Animated v1.2.2. https://civitai.com/models/7371.
  58. ToonYou Beta 6. https://civitai.com/models/30240.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shanchuan Lin (17 papers)
  2. Xiao Yang (158 papers)
Citations (10)

Summary

AnimateDiff-Lightning: A Leap in Few-Step Video Generation Through Cross-Model Diffusion Distillation

Introduction to the Paper

In the constantly evolving landscape of generative models for video, the efficiency of video generation models has been a bottleneck, limiting their wider application. Among the various approaches, AnimateDiff has emerged as a popular choice due to its innovative use of learnable temporal motion modules integrated with frozen image generation models. This allows for the leveraging of image priors to generate temporally coherent frames efficiently. However, the iterative nature of the diffusion process intrinsic to these models entails a significant computational cost, particularly apparent in the context of video generation. Addressing this challenge, the paper introduces AnimateDiff-Lightning, a model that significantly accelerates the video generation process while maintaining, and in some aspects enhancing, the quality of the generated videos.

Methodology

The core innovation of AnimateDiff-Lightning lies in its application of progressive adversarial diffusion distillation specifically tailored for video models. This approach has shown promising results in few-step image generation and is now extended to videos for the first time. The model simultaneously distills the probability flow of multiple base diffusion models, resulting in a distilled motion module that exhibits broader style compatibility.

Key aspects:

  • Cross-Model Distillation: The proposed distillation methodology effectively combines the probability flow of various base models to distill into a single, shared motion module. This approach not only improves quality across several pre-selected base models but also enhances compatibility with unseen base models.
  • Model and Data Preparation: Using popular base models for both realistic and anime styles, the authors generated an extensive dataset to facilitate the distillation process. This dataset addresses the challenge of out-of-distribution samples when applying anime-style models.
  • Flow-Conditional Video Discriminator: The discriminator is extended to accommodate multiple flows of different base models, enabling it to critique separate flow trajectories accordingly. This is a crucial component of the progressive adversarial diffusion distillation technique adapted for video.

Evaluation & Results

Quantitative and qualitative evaluations demonstrate that AnimateDiff-Lightning sets a new benchmark in few-step video generation quality. Compared to its predecessor, AnimateLCM, it achieves better quality with reduced inference steps across various styles, including unseen base models. Notably, this model shows compatibility with existing modules for motion control and aspect ratio adjustments, underscoring its versatility and practical utility.

Implications & Future Directions

AnimateDiff-Lightning represents a significant step forward in the development of fast and efficient generative models for video content. By effectively addressing the speed-quality trade-off, it opens new possibilities for real-time applications and complex video generation tasks. The success of cross-model diffusion distillation in this context also suggests a promising avenue for further research, potentially leading to even more versatile and universally applicable distillation modules for various modalities.

Conclusion

In summary, AnimateDiff-Lightning achieves its goal of accelerating video generation without compromising quality through the innovative use of cross-model diffusion distillation. The release of this model to the community is expected to catalyze further advancements in generative AI, particularly in applications requiring high-quality video content generated efficiently.

Youtube Logo Streamline Icon: https://streamlinehq.com