How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

Published 10 Jun 2025 in cs.CV, cs.AI, and cs.CL | (2506.08351v1)

Abstract: With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces Step AG, an adaptive guidance strategy that selectively applies classifier-free guidance to reduce computational cost during critical denoising steps.
It demonstrates that applying guidance only during the initial 30%-50% of steps maintains image quality and text alignment, yielding a 20%-30% reduction in inference time.
The proposed approach is versatile, showing consistent performance improvements across both image and video diffusion models, paving the way for standardized acceleration techniques.

Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

Introduction

The research presented in "How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models" (2506.08351) investigates the efficiency of adaptive guidance strategies in text-to-vision diffusion models, specifically focusing on classifier-free guidance (CFG). The study introduces Step AG, an adaptive strategy that achieves high-quality image generation with reduced computational costs. It offers an insightful analysis of why adaptive guidance works and demonstrates its universal applicability across different models, including video generation models.

Diffusion models [NEURIPS2020_DDPM, song2021DDIM] have become prominent for generative tasks like image, video, and audio synthesis. Classifier-free guidance [ho2022classifier] is a widely used conditioning method that requires twice the model forwarding steps compared to unconditional generation, leading to higher computational costs.

Previous research efforts to mitigate these costs include distillation techniques [meng2023distillation], training-free acceleration methods like caching [ma2024deepcache], and adaptive guidance strategies [castillo2023adaptive]. Unlike them, Step AG offers a simpler, universally applicable approach that does not rely on specific architectural dependencies or intricate hyperparameter tuning.

Adaptive Guidance Mechanism

The core concept of adaptive guidance is to decide whether the conditional or unconditional model predictions (scores) can be used at each denoising step to reduce computation without sacrificing output quality. The paper introduces Step AG, built on the understanding that during later denoising steps, the high signal-to-noise ratio (SNR) renders classifier-free guidance less critical. This leads to a reduction in computational cost by limiting guidance application to the initial 30%-50% of steps.

Implementation and Results

The paper details an empirical evaluation across various diffusion models like Stable-Diffusion and PixArt- $\Sigma$ , demonstrating an average inference time reduction of 20%-30%, while maintaining competitive image and text alignment metrics.

Figure 1: Cosine Similarity $\gamma_t$ of different models. We calculate cosine similarity under their default inference step settings.

Notably, the proposed Step AG strategy consolidates the performance criteria for various generation tasks, providing a consistent improvement across models. This flexibility indicates the potential to standardize acceleration techniques in text-to-vision diffusion settings, making it highly adaptable to existing frameworks.

Performance Considerations

The effectiveness of Step AG compared to traditional CFG and other adaptive strategies like Similarity AG is highlighted by its empirical success across benchmarks. The scalability and implementation ease, due to the predictable computation cost reduction, make it a favorable choice for various applications. The study also discusses the influence of guidance scales and inference steps, asserting that appropriate tuning of these parameters alongside Step AG can enhance practical performance outcomes.

Figure 2: SNR of different diffusion models. This is calculated under training timestep setting. Inference bears a similar trend.

Video Generation Extensions

Step AG's applicability extends to video generation, as evidenced by tests on CogVideoX and ModelScope models, where it facilitated notable acceleration without compromising on the qualitative aspects of generated content. This paves the way for its deployment in more computation-intensive generative contexts.

Figure 3: Performance of Step AG of different video generation models. SPV(ratio) is compared with spv under $p=1.0$ .

Conclusion

The introduction of Step AG in the field of adaptive guidance for diffusion models marks a significant stride in optimizing computational efficiency. Its universal applicability, ease of implementation, and consistent preservation of high-quality generation make it a viable strategy for enhancing the efficiency of existing diffusion-based generative systems. Future research could explore further optimizations and adaptations of Step AG for broader applications in real-time generative systems.

Markdown Report Issue