FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality (2410.19355v1)

Published 25 Oct 2024 in cs.CV

Abstract: In this paper, we present \textbf{\textit{FasterCache}}, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that \textit{directly reusing adjacent-step features degrades video quality due to the loss of subtle variations}. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (\eg 1.67$\times$ speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.

PDF HTML Abstract

Accelerating Video Diffusion Models with FasterCache

The paper "FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality" introduces an innovative strategy aimed at accelerating the inference of video diffusion models without compromising the generated video quality. The proposed strategy, named FasterCache, targets the inefficiencies in current diffusion models primarily caused by high computational and memory demands during inference.

Core Contributions

The authors make several contributions to the field of video synthesis through diffusion models:

Dynamic Feature Reuse: A novel dynamic feature reuse strategy is proposed to address the issue of directly reusing adjacent-step features in attention modules, which tends to degrade video quality. By accounting for subtle yet significant variations between timesteps, this strategy ensures both feature distinction and continuity, preserving small but crucial details in the iterative denoising process.
CFG-Cache Optimization: The paper explores the acceleration potential of Classifier-Free Guidance (CFG) and reveals notable redundancy between conditional and unconditional features within the same timestep. Capitalizing on this, CFG-Cache is introduced to optimize the reuse of these outputs. The method involves storing frequency biases between conditional and unconditional outputs, which are dynamically enhanced and reused, thus accelerating inference without sacrificing visual detail quality.
Significant Speedup Achievements: Empirical results demonstrate that FasterCache provides a remarkable speedup—up to 1.67× on the Vchitect-2.0 model—while maintaining video quality comparable to the baseline. This performance is consistently superior in both inference speed and video quality benchmarks when compared to existing methods.

Implications and Future Directions

Practical Implications: The considerable reduction in inference time achieved by FasterCache addresses a major limitation in the practical use of video diffusion models. This enhanced efficiency without requiring additional training costs makes it a viable approach for various applications needing rapid video generation with high fidelity, such as virtual reality, special effects, and real-time video generation.

Theoretical Implications and Exploration: The paper provides insights into the potential for further optimizations across other aspects of diffusion models, particularly in handling redundancies in processing steps. The innovative approach in feature caching and reuse can inspire future research to investigate other unexplored areas within deep learning models where similar optimization can be applied.

Speculation on AI Developments: As AI continues to evolve, strategies like FasterCache might be instrumental in pushing the boundaries of real-time video synthesis applications. The principles of efficient inference through feature reuse and strategic caching could be extrapolated to other domains of AI, potentially leading to breakthroughs in real-time robotics vision systems, AI-driven simulation environments, and more.

Experimental Validation

The paper details extensive experiments across various video diffusion models, including Open-Sora 1.2, Open-Sora-Plan, Latte, CogVideoX, and Vchitect-2.0. The results underscore FasterCache's applicability across different architectures and its robustness in handling videos of varying lengths and resolutions. Importantly, the evaluation metrics cover both efficiency (in terms of Multiply-Accumulate Operations and latency) and visual quality (measured by VBench, LPIPS, SSIM, and PSNR), ensuring a comprehensive assessment of the method's impact.

Conclusion

The introduction of FasterCache represents a significant step forward in optimizing video diffusion models through a training-free strategy. By intelligently leveraging feature reuse and redundancy in CFG, FasterCache not only boosts inference efficiency but also ensures the preservation of high-quality video outputs. This paper serves as a foundational work for further exploration in enhancing diffusion model efficiency and could have broad implications for real-world applications of AI-generated video content.