Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales (2506.19713v1)

Published 24 Jun 2025 in cs.LG

Abstract: Classifier-free guidance (CFG) has become an essential component of modern conditional diffusion models. Although highly effective in practice, the underlying mechanisms by which CFG enhances quality, detail, and prompt alignment are not fully understood. We present a novel perspective on CFG by analyzing its effects in the frequency domain, showing that low and high frequencies have distinct impacts on generation quality. Specifically, low-frequency guidance governs global structure and condition alignment, while high-frequency guidance mainly enhances visual fidelity. However, applying a uniform scale across all frequencies -- as is done in standard CFG -- leads to oversaturation and reduced diversity at high scales and degraded visual quality at low scales. Based on these insights, we propose frequency-decoupled guidance (FDG), an effective approach that decomposes CFG into low- and high-frequency components and applies separate guidance strengths to each component. FDG improves image quality at low guidance scales and avoids the drawbacks of high CFG scales by design. Through extensive experiments across multiple datasets and models, we demonstrate that FDG consistently enhances sample fidelity while preserving diversity, leading to improved FID and recall compared to CFG, establishing our method as a plug-and-play alternative to standard classifier-free guidance.

PDF Abstract

Guidance in the Frequency Domain for High-Fidelity Diffusion Sampling at Low CFG Scales

The paper "Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales" (Sadat et al., 24 Jun 2025 ) presents a systematic analysis and practical enhancement of classifier-free guidance (CFG) in diffusion models by decomposing the guidance signal into frequency components. The authors introduce Frequency-Decoupled Guidance (FDG), a plug-and-play modification to CFG that applies distinct guidance strengths to low- and high-frequency components, yielding improved sample quality and diversity, particularly at low guidance scales.

Motivation and Analysis

Classifier-free guidance is a widely adopted technique in conditional diffusion models, interpolating between conditional and unconditional model predictions to improve sample fidelity and prompt alignment. However, standard CFG applies a uniform guidance scale across all frequency components, leading to a well-known trade-off: high guidance scales improve detail and alignment but reduce diversity and cause oversaturation, while low scales preserve diversity but yield blurry, low-quality samples.

The authors provide a frequency-domain analysis of the CFG update rule, showing that:

Low-frequency guidance primarily controls global structure and prompt alignment.
High-frequency guidance enhances visual fidelity and detail with minimal impact on global composition.

Empirical evidence demonstrates that excessive low-frequency guidance is the main cause of reduced diversity and oversaturation at high CFG scales, while high-frequency guidance can be increased to improve quality without these adverse effects.

Frequency-Decoupled Guidance (FDG)

Building on these insights, FDG decomposes the guidance signal at each sampling step into low- and high-frequency components (using, e.g., a Laplacian pyramid or wavelet transform) and applies separate guidance scales to each:

A conservative scale is used for low frequencies to avoid diversity loss and oversaturation.
A stronger scale is used for high frequencies to enhance detail and fidelity.

This approach is formalized as:

def laplacian_guidance(pred_cond, pred_uncond, guidance_scale):
    # pred_cond, pred_uncond: [B, C, H, W]
    # guidance_scale: [low_freq_scale, high_freq_scale]
    cond_pyr = build_laplacian_pyramid(pred_cond, levels=2)
    uncond_pyr = build_laplacian_pyramid(pred_uncond, levels=2)
    guided_pyr = [
        uncond_pyr[0] + high_freq_scale * (cond_pyr[0] - uncond_pyr[0]),  # High freq
        uncond_pyr[1] + low_freq_scale * (cond_pyr[1] - uncond_pyr[1])    # Low freq
    ]
    return reconstruct_from_pyramid(guided_pyr)

This modification is computationally negligible and requires no retraining, making it compatible with any pretrained diffusion model.

Empirical Results

Extensive experiments across class-conditional and text-to-image diffusion models (EDM2, DiT-XL/2, Stable Diffusion 2.1/XL/3) demonstrate that FDG:

Consistently improves FID and recall over standard CFG at low guidance scales, indicating better quality and diversity.
Maintains or improves prompt alignment (as measured by CLIP Score, ImageReward, HPSv2, PickScore).
Reduces oversaturation and artifacts associated with high CFG scales.
Enhances text rendering in text-to-image models, a regime where standard CFG struggles to balance detail and realism.
Is robust across samplers and compatible with distilled models (e.g., SDXL-Lightning), where standard CFG often degrades output quality.

Quantitative results show, for example, FID improvements of 3–4 points and recall increases of 0.05–0.12 across models and datasets, with no loss in precision.

Implementation and Deployment Considerations

Integration: FDG can be implemented as a drop-in replacement for the CFG update in the sampling loop, requiring only a frequency decomposition and recombination step per iteration.
Frequency Decomposition: Both Laplacian pyramids and wavelet transforms are effective; the method is not sensitive to the specific choice as long as the decomposition meaningfully separates low and high frequencies.
Parameter Selection: The low- and high-frequency guidance scales can be tuned empirically; the paper provides recommended values for common models.
Computational Overhead: The additional cost is negligible compared to the overall sampling process.
Compatibility: FDG is compatible with other guidance and diversity-enhancing techniques (e.g., CADS, APG) and can be combined for further improvements.

Theoretical and Practical Implications

The frequency-domain perspective clarifies the mechanisms by which CFG affects sample quality and diversity, providing a principled basis for decoupling guidance effects. This insight enables:

Systematic improvement of sample quality at low guidance scales, avoiding the need for high guidance and its associated drawbacks.
A general framework for plug-and-play guidance modification applicable to any conditional diffusion model.
Potential for further research into adaptive, content-aware, or multi-band guidance strategies, as well as applications in other generative domains (e.g., video, audio).

Future Directions

Adaptive frequency-band guidance: Dynamically adjusting guidance scales based on content or sampling step.
Extension to other modalities: Applying frequency-decoupled guidance in video, audio, or 3D generative models.
Integration with training: Exploring whether frequency-aware objectives during training further enhance model performance.
Analysis of out-of-distribution robustness: Investigating FDG's impact on generalization and robustness in challenging domains.

Conclusion

This work provides a rigorous analysis and practical solution to the longstanding trade-off between quality and diversity in classifier-free guided diffusion models. By decoupling guidance in the frequency domain, FDG enables high-fidelity, diverse sampling at low guidance scales, with minimal implementation burden and broad applicability. The frequency-domain perspective is likely to inform future developments in both the theory and practice of guided generative modeling.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Seyedmorteza Sadat (9 papers)
Tobias Vontobel (3 papers)
Farnood Salehi (10 papers)
Romann M. Weber (12 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/kwangmoo_yi/status/1937970204895776920

https://twitter.com/Msadat97/status/1937901316937785377