Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality (2410.19355v1)

Published 25 Oct 2024 in cs.CV

Abstract: In this paper, we present \textbf{\textit{FasterCache}}, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that \textit{directly reusing adjacent-step features degrades video quality due to the loss of subtle variations}. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (\eg 1.67$\times$ speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  2. Token merging for fast stable diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4599–4603, 2023.
  3. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  4. Pixart-σ𝜎\sigmaitalic_σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024a.
  5. Pixart-δ𝛿\deltaitalic_δ: Fast and controllable image generation with latent consistency models, 2024b.
  6. δ𝛿\deltaitalic_δ-dit: A training-free acceleration method tailored for diffusion transformers. arXiv preprint arXiv:2406.01125, 2024c.
  7. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  8. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  9. Ptqd: Accurate post-training quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  10. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  11. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  12. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
  13. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  14. Open-sora-plan, April 2024. URL https://doi.org/10.5281/zenodo.10948109.
  15. Autodiffusion: Training-free optimization of time steps and architectures for automated diffusion model acceleration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7105–7114, 2023a.
  16. Faster diffusion: Rethinking the role of unet encoder in diffusion models. arXiv preprint arXiv:2312.09608, 2023b.
  17. Q-dm: An efficient low-bit quantized diffusion model. Advances in Neural Information Processing Systems, 36, 2024a.
  18. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems, 36, 2024b.
  19. Animatediff-lightning: Cross-model diffusion distillation. arXiv preprint arXiv:2403.12706, 2024.
  20. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  21. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  22. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  23. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024a.
  24. Deepcache: Accelerating diffusion models for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15762–15772, 2024b.
  25. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  26. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  27. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  28. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  29. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1972–1981, 2023.
  30. Freeu: Free lunch in diffusion u-net. In CVPR, 2024.
  31. Temporal dynamic quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024a.
  32. Frdiff : Feature reuse for universal training-free acceleration of diffusion models, 2024b. URL https://arxiv.org/abs/2312.03517.
  33. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  34. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  35. Bitsfusion: 1.99 bits weight quantization of diffusion model. arXiv preprint arXiv:2406.04333, 2024.
  36. Vchitect. Vchitect-2.0: Parallel transformer for scaling up video diffusion models, 2024. URL https://github.com/Vchitect/Vchitect-2.0.
  37. Attention-driven training-free efficiency enhancement of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16080–16089, 2024.
  38. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  39. Cache me if you can: Accelerating diffusion models through block caching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6211–6220, 2024.
  40. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7623–7633, 2023.
  41. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  42. Denoising diffusion step-aware models. arXiv preprint arXiv:2310.03337, 2023.
  43. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
  44. Laptop-diff: Layer pruning and normalized distillation for compressing diffusion models. arXiv preprint arXiv:2404.11098, 2024a.
  45. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  46. Cross-attention makes inference cumbersome in text-to-image diffusion models. arXiv preprint arXiv:2404.02747, 2024b.
  47. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023.
  48. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation. arXiv preprint arXiv:2406.02540, 2024a.
  49. Dsp: Dynamic sequence parallelism for multi-dimensional transformers. arXiv preprint arXiv:2403.10266, 2024b.
  50. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024c.
  51. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/hpcaitech/Open-Sora.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhengyao Lv (9 papers)
  2. Chenyang Si (36 papers)
  3. Junhao Song (15 papers)
  4. Zhenyu Yang (56 papers)
  5. Yu Qiao (563 papers)
  6. Ziwei Liu (368 papers)
  7. Kwan-Yee K. Wong (51 papers)
Citations (1)

Summary

Accelerating Video Diffusion Models with FasterCache

The paper "FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality" introduces an innovative strategy aimed at accelerating the inference of video diffusion models without compromising the generated video quality. The proposed strategy, named FasterCache, targets the inefficiencies in current diffusion models primarily caused by high computational and memory demands during inference.

Core Contributions

The authors make several contributions to the field of video synthesis through diffusion models:

  1. Dynamic Feature Reuse: A novel dynamic feature reuse strategy is proposed to address the issue of directly reusing adjacent-step features in attention modules, which tends to degrade video quality. By accounting for subtle yet significant variations between timesteps, this strategy ensures both feature distinction and continuity, preserving small but crucial details in the iterative denoising process.
  2. CFG-Cache Optimization: The paper explores the acceleration potential of Classifier-Free Guidance (CFG) and reveals notable redundancy between conditional and unconditional features within the same timestep. Capitalizing on this, CFG-Cache is introduced to optimize the reuse of these outputs. The method involves storing frequency biases between conditional and unconditional outputs, which are dynamically enhanced and reused, thus accelerating inference without sacrificing visual detail quality.
  3. Significant Speedup Achievements: Empirical results demonstrate that FasterCache provides a remarkable speedup—up to 1.67× on the Vchitect-2.0 model—while maintaining video quality comparable to the baseline. This performance is consistently superior in both inference speed and video quality benchmarks when compared to existing methods.

Implications and Future Directions

Practical Implications: The considerable reduction in inference time achieved by FasterCache addresses a major limitation in the practical use of video diffusion models. This enhanced efficiency without requiring additional training costs makes it a viable approach for various applications needing rapid video generation with high fidelity, such as virtual reality, special effects, and real-time video generation.

Theoretical Implications and Exploration: The paper provides insights into the potential for further optimizations across other aspects of diffusion models, particularly in handling redundancies in processing steps. The innovative approach in feature caching and reuse can inspire future research to investigate other unexplored areas within deep learning models where similar optimization can be applied.

Speculation on AI Developments: As AI continues to evolve, strategies like FasterCache might be instrumental in pushing the boundaries of real-time video synthesis applications. The principles of efficient inference through feature reuse and strategic caching could be extrapolated to other domains of AI, potentially leading to breakthroughs in real-time robotics vision systems, AI-driven simulation environments, and more.

Experimental Validation

The paper details extensive experiments across various video diffusion models, including Open-Sora 1.2, Open-Sora-Plan, Latte, CogVideoX, and Vchitect-2.0. The results underscore FasterCache's applicability across different architectures and its robustness in handling videos of varying lengths and resolutions. Importantly, the evaluation metrics cover both efficiency (in terms of Multiply-Accumulate Operations and latency) and visual quality (measured by VBench, LPIPS, SSIM, and PSNR), ensuring a comprehensive assessment of the method's impact.

Conclusion

The introduction of FasterCache represents a significant step forward in optimizing video diffusion models through a training-free strategy. By intelligently leveraging feature reuse and redundancy in CFG, FasterCache not only boosts inference efficiency but also ensures the preservation of high-quality video outputs. This paper serves as a foundational work for further exploration in enhancing diffusion model efficiency and could have broad implications for real-world applications of AI-generated video content.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com