Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales (2506.19713v1)

Published 24 Jun 2025 in cs.LG

Abstract: Classifier-free guidance (CFG) has become an essential component of modern conditional diffusion models. Although highly effective in practice, the underlying mechanisms by which CFG enhances quality, detail, and prompt alignment are not fully understood. We present a novel perspective on CFG by analyzing its effects in the frequency domain, showing that low and high frequencies have distinct impacts on generation quality. Specifically, low-frequency guidance governs global structure and condition alignment, while high-frequency guidance mainly enhances visual fidelity. However, applying a uniform scale across all frequencies -- as is done in standard CFG -- leads to oversaturation and reduced diversity at high scales and degraded visual quality at low scales. Based on these insights, we propose frequency-decoupled guidance (FDG), an effective approach that decomposes CFG into low- and high-frequency components and applies separate guidance strengths to each component. FDG improves image quality at low guidance scales and avoids the drawbacks of high CFG scales by design. Through extensive experiments across multiple datasets and models, we demonstrate that FDG consistently enhances sample fidelity while preserving diversity, leading to improved FID and recall compared to CFG, establishing our method as a plug-and-play alternative to standard classifier-free guidance.

PDF Abstract

Guidance in the Frequency Domain for High-Fidelity Diffusion Sampling at Low CFG Scales

The paper "Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales" (Sadat et al., 24 Jun 2025 ) presents a systematic analysis and practical enhancement of classifier-free guidance (CFG) in diffusion models by decomposing the guidance signal into frequency components. The authors introduce Frequency-Decoupled Guidance (FDG), a plug-and-play method that applies distinct guidance strengths to low- and high-frequency bands, thereby improving sample fidelity and diversity without retraining or significant computational overhead.

Motivation and Analysis

CFG is a widely adopted inference technique in conditional diffusion models, interpolating between conditional and unconditional model predictions to improve sample quality and prompt alignment. However, standard CFG applies a uniform guidance scale across all frequency components, leading to a well-known trade-off: high guidance scales improve detail and alignment but cause oversaturation and reduce diversity, while low scales preserve diversity but yield blurry, low-fidelity samples.

The authors provide a frequency-domain analysis of the CFG update rule, leveraging linear and invertible transforms such as Laplacian pyramids or wavelet decompositions. They empirically demonstrate that:

Low-frequency guidance primarily controls global structure and condition alignment, but excessive scaling in this band is responsible for reduced diversity and oversaturation.
High-frequency guidance enhances visual fidelity and detail, with minimal impact on diversity or global structure.

This decomposition clarifies the mechanism by which CFG improves quality and exposes the root cause of its adverse effects at high scales.

Frequency-Decoupled Guidance (FDG)

Building on these insights, FDG modifies the CFG update rule by applying separate guidance scales to the low- and high-frequency components of the denoiser outputs. Formally, for conditional ( $D_c$ ) and unconditional ( $D_u$ ) predictions, and frequency decomposition operator $\psi$ :

Decompose: $\psi(D_c) = \{D_c^{\text{low}}, D_c^{\text{high}}\}$ , $\psi(D_u) = \{D_u^{\text{low}}, D_u^{\text{high}}\}$
Apply guidance: $D^{\text{low}} = D_u^{\text{low}} + w_{\text{low}} (D_c^{\text{low}} - D_u^{\text{low}})$

$D^{\text{high}} = D_u^{\text{high}} + w_{\text{high}} (D_c^{\text{high}} - D_u^{\text{high}})$

Reconstruct: $\hat{D} = \psi^{-1}(\{D^{\text{low}}, D^{\text{high}}\})$

By setting $w_{\text{low}} < w_{\text{high}}$ , FDG preserves the diversity and color composition of low guidance scales while enhancing detail akin to high guidance scales. The method is model-agnostic and requires only minor modifications to the sampling loop.

Pseudocode Example

def fdg_guidance(pred_cond, pred_uncond, w_low, w_high, laplacian_pyramid):
    # Decompose into frequency bands
    cond_low, cond_high = laplacian_pyramid(pred_cond)
    uncond_low, uncond_high = laplacian_pyramid(pred_uncond)
    # Apply separate guidance
    guided_low = uncond_low + w_low * (cond_low - uncond_low)
    guided_high = uncond_high + w_high * (cond_high - uncond_high)
    # Reconstruct image
    return laplacian_pyramid.inverse(guided_low, guided_high)

Empirical Results

The authors conduct extensive experiments on class-conditional and text-to-image diffusion models, including EDM2, DiT-XL/2, Stable Diffusion 2.1, XL, and 3. Key findings include:

Consistent improvement in FID and recall across all tested models and samplers, indicating better sample quality and diversity.
Superior prompt alignment and human preference metrics (ImageReward, HPSv2, PickScore, CLIP Score) compared to standard CFG, especially at low guidance scales.
Compatibility with fast/distilled samplers (e.g., SDXL-Lightning), where standard CFG often degrades quality but FDG maintains or improves it.
Enhanced text rendering in text-to-image models, as high-frequency guidance can be increased without sacrificing realism.

Quantitative results (see Table 1 and 2 in the paper) show that FDG achieves lower FID and higher recall than CFG at equivalent or lower computational cost. Ablation studies confirm that the method is robust to the choice of frequency decomposition (Laplacian pyramid vs. wavelet) and works with both single- and multi-level decompositions.

Implementation and Deployment Considerations

Computational Overhead: The additional cost of frequency decomposition and recomposition (e.g., Laplacian pyramid) is negligible relative to the denoiser forward pass.
Integration: FDG can be implemented as a wrapper around the denoiser output in existing sampling pipelines, requiring only a few lines of code.
Hyperparameters: The choice of $w_{\text{low}}$ and $w_{\text{high}}$ can be tuned per model or application, but the method is not highly sensitive to these values.
Model Compatibility: FDG is applicable to any pretrained diffusion model using CFG, with no retraining or fine-tuning required.

Theoretical and Practical Implications

The frequency-domain perspective on CFG provides a principled explanation for the observed trade-offs in sample quality and diversity. By decoupling guidance across frequency bands, FDG enables high-fidelity, diverse generation at low guidance scales, which is particularly valuable for applications requiring both realism and variability (e.g., creative content generation, data augmentation, and robust conditional synthesis).

The approach also offers a framework for further research on adaptive or learned frequency-dependent guidance, potentially leading to even more effective and controllable generative models. Additionally, the analysis suggests new directions for understanding and mitigating artifacts in guided diffusion sampling, such as oversaturation and prompt overfitting.

Future Directions

Adaptive or learned frequency guidance schedules that dynamically adjust $w_{\text{low}}$ and $w_{\text{high}}$ during sampling.
Extension to other modalities (e.g., audio, video) where frequency decomposition is natural.
Integration with other diversity-promoting techniques (e.g., CADS, APG) for further improvements.
Theoretical analysis of guidance in the spectral domain to inform model and sampler design.

In summary, this work provides both a conceptual advance in understanding classifier-free guidance and a practical, easily deployable method for improving diffusion model sampling, with strong empirical support across a range of models and tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Seyedmorteza Sadat (9 papers)
Tobias Vontobel (3 papers)
Farnood Salehi (10 papers)
Romann M. Weber (12 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/kwangmoo_yi/status/1937970204895776920

https://twitter.com/fly51fly/status/1939435843170824453

https://twitter.com/Msadat97/status/1937901316937785377