Classifier-Free Guidance Approach

Updated 10 November 2025

Classifier-Free Guidance is a mechanism in conditional diffusion models that linearly combines conditional and unconditional denoisers to improve sample quality.
It uses a guidance scale to steer samples towards high likelihood regions, balancing enhanced fidelity with a trade-off in diversity.
CFGibbs introduces a stochastic Gibbs-like iterative method to recover the missing Rényi correction, yielding superior fidelity-diversity trade-offs in image and audio synthesis.

Classifier-free guidance (CFG) is a mechanism in conditional diffusion models that linearly combines the outputs of the conditional and unconditional denoisers to increase the fidelity and semantic alignment of generated samples. While CFG is widely adopted for improving visual quality and prompt adherence, it introduces a trade-off: higher guidance scale enhances sample fidelity at the cost of reduced diversity. Recent work has established that conventional CFG does not correspond to a fully consistent denoising diffusion model (DDM), and omits a theoretically necessary correction—motivating new algorithms such as Classifier-Free Gibbs-like Guidance (CFGibbs) to address this deficiency and improve sample quality-diversity trade-offs (Moufad et al., 27 May 2025).

1. Mathematical Formulation and Operational Principle

Consider a conditional denoising diffusion model with two denoisers evaluated at each noise level $t$ :

Unconditional denoiser: $D_\theta(x_t, t)$ approximates $\mathbb{E}[x_{t-1} \mid x_t]$
Conditional denoiser: $D_\theta(x_t, t \mid y)$ estimates $\mathbb{E}[x_{t-1} \mid x_t, y]$ , where $y$ indicates the conditioning (e.g., class label, prompt)

CFG modifies the update rule by constructing a linear interpolation (in noise or score parameterization): $\epsilon_{\text{CFG}}(x_t, t; y) = (1 + w)\,\epsilon_\theta(x_t, t \mid y) - w\,\epsilon_\theta(x_t, t)$ where $w > 1$ is the guidance strength. This pushes the sample trajectory preferentially towards regions of high $p(y \mid x)$ , enhaning alignment and sample sharpness, while reducing diversity.

2. Theoretical Consistency and the Rényi Correction Term

CFG is often equated with sampling from a "tilted" marginal: $p_\sigma^{c; w}(x) \propto p_\sigma(x) \, p(y \mid x)^w$ where $p_\sigma(x)$ is the data distribution at noise scale $\sigma$ and $p(y \mid x)$ is the classifier or conditional likelihood. However, the true score of this marginal, by Tweedie’s formula, is: $\nabla_x \log p_\sigma^{c; w}(x) = (w - 1) \nabla_x R_\sigma(x, y; w) + \nabla_x \log \left[p_\sigma(x) p(y \mid x)^w\right]$ with

$R_\sigma(x, y; w) = \frac{1}{w-1} \log \int \left[p(y \mid x_0)\right]^w p(x_0)^{-w+1} p(x \mid x_0) dx_0$

The conventional CFG update implements only the second term (amplified score of the conditional and unconditional denoisers), while the theoretically necessary first term, $(w-1)\nabla_x R_\sigma(x, y; w)$ , is omitted. This missing component is a repulsive force that corrects for excessive concentration, effectively preserving sample diversity. Its omission causes mode collapse when $w$ is strong.

3. Asymptotics: Rényi Term and Low-Noise Regime

The magnitude of the missing Rényi correction term vanishes as noise approaches zero. Specifically,

$\nabla_x R_\sigma(x, y; w) = O(\sigma^2) \quad (\sigma \to 0)$

At late denoising steps (low noise), conventional CFG thus becomes almost correct. However, at higher noise levels (early and mid denoising), the neglected correction leads to systematic discrepancies and overconcentration of samples.

4. Classifier-Free Gibbs-like Guidance (CFGibbs) Algorithm

CFGibbs is proposed to recover the missing repulsive effect and sample from the truly intended "tilted" posterior $p_0^{c; w}(x) \propto p(x) p(y \mid x)^w$ . It achieves this using an MCMC-like iterative procedure that alternates between injecting small Gaussian noise and repeated denoising under strong guidance. This interleaving introduces exploration (by adding noise) and exploitation (by denoising with guidance):

Start: $X_T \sim \mathcal N(0, \sigma_T^2 I)$
Initial denoising: Run $T_0$ ODE steps from $X_T$ with moderate guidance $w_0$ to obtain $X_0^0$
For $r = 1, \dots, R$ $r = 1, \dots, R$ (number of Gibbs iterations):
- Add noise: $X_{\sigma_*}^r = X_0^{r-1} + \sigma_* Z$ , $Z \sim \mathcal N(0, I)$
- Denoise: Run ODE from $\sigma_*$ to $0$ with strong guidance $w$ using a fraction of the denoising steps to yield $X_0^r$
Output: $X_0^R$ as the generated sample

As $R \to \infty$ and $\sigma_* \to 0$ , the method converges to the exact tilted density $p_0^{c; w}(x)$ . In a one-dimensional Gaussian case, this convergence is exact up to an $O(\sigma_*^2)$ error.

5. Empirical Performance Comparison

CFGibbs was evaluated using both image (ImageNet-512; EDM2-S/XXL; 32 Heun steps) and audio (AudioCaps; AudioLDM 2-Large; 200 DDIM steps) benchmarks against several established CFG variants:

CFG (standard)
CFG (limited-interval) / CFGpp (manifold-constrained)
CFGibbs (proposed)

The results demonstrate:

CFGibbs achieves the lowest or near-lowest FID and FD $_{\mathrm{DINOv2}}$ (CLIP-like) scores, with consistently better precision/recall and density/coverage trade-off
For text-to-audio, CFGibbs yields the lowest FAD and competitive KL and IS, outperforming all CFG baselines under corresponding metrics
The gains in perceptual quality and diversity are in the 10–20% range over standard CFG, with modest runtime overhead (≃15–20% over standard per 500-image batch)

6. Practical Implementation Considerations

CFGibbs employs Heun’s method for images and DDIM for audio as the sampler.
The noise schedule for images uses a Karras power law; for audio, discrete variance-preserving steps are used.
Hyperparameters (see Table A.5 in (Moufad et al., 27 May 2025)):
- Typical: $T=32$ (EDM2), $T_0=12$ , $w_0=1$ , $w=2.3$ or $2$, $R=2$ , $\sigma_*=2$ (EDM2-S), or adjusted per model.
Code is available at https://github.com/yazidjanati/cfgig (JAX/Flax) and is executable on a single modern GPU.

Summary Table of Key Quantitative Results

Model / Metric	CFG	CFGibbs (proposed)
FID $_{64}$ (ImageNet-512 S)	higher	lower
FD $_\mathrm{DINOv2}$	higher	lower
Precision	lower	higher
Recall	lower	higher
Coverage	lower	higher
FAD (AudioCaps)	higher	lower

Across both domains, CFGibbs offers superior trade-offs between conditional alignment and diversity, outperforming prior heuristics.

7. Theoretical and Practical Significance

This analysis establishes that conventional classifier-free guidance omits a crucial corrective drift (the Rényi divergence gradient), which is only negligible during the late denoising regime but significant in the early phases where sample contraction occurs. By correcting for this omission using a stochastic Gibbs-like mechanism, CFGibbs recovers the full target distribution $p(x) p(y \mid x)^w$ up to small discretization/noise errors, increasing both the fidelity and the diversity of generated samples for fixed computational resources. The method is practical, requires no retraining, and introduces only a modest inference-time overhead (Moufad et al., 27 May 2025).

PDF Markdown Chat (Pro)

References (1)

Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance (2025)

Follow Topic

Get notified by email when new papers are published related to Classifier-Free Guidance Approach.