CFG Resolution Weighting (CFG-RW)
- CFG-RW is a method that rectifies the expectation shift in conventional classifier-free guidance by modifying the coefficient constraints.
- It relaxes the traditional linear sum-to-one restriction, enforcing a zero-mean property to maintain diffusion process consistency.
- Empirical results show that CFG-RW enhances FID scores and conditional alignment across various diffusion samplers with minimal computational overhead.
CFG Resolution Weighting (CFG-RW), more rigorously characterized as Rectified Classifier-Free Guidance (ReCFG), refers to a post-hoc modification of the coefficient selection used for classifier-free guidance in diffusion model sampling. Conventional classifier-free guidance (CFG) employs a linear combination of conditional and unconditional score estimates, governed by coefficients that sum to unity. However, this approach introduces a systematic bias—an "expectation shift"—which theoretically disrupts the reciprocity of the reverse diffusion process. CFG Resolution Weighting corrects this bias by relaxing the “sum-to-one” constraint, instead solving for guidance coefficients that enforce a zero-mean property of the combined score, thereby restoring theoretical consistency with the forward–reverse SDE/ODE framework and improving sampling fidelity in conditional generative modeling (Xia et al., 24 Oct 2024).
1. Theoretical Basis and Expectation Shift
Standard CFG replaces the true conditional score with a weighted mixture: for . In the -prediction formulation this corresponds to
While this sharpens the conditional distribution , it violates the zero-mean property that is critical for diffusion-theoretic reversibility. Specifically,
and thus,
This expectation shift prevents the reverse process from precisely inverting the forward diffusion, resulting in a systematic bias away from (Xia et al., 24 Oct 2024).
2. Derivation of Rectified Guidance Weights
CFG Resolution Weighting introduces two free coefficients, and , corresponding to conditional and unconditional branches: with the -space form: A zero-expectation (annihilation) constraint is enforced: Estimating and via Monte Carlo, the optimal coefficient is found in closed form: Practically, is set to the guidance strength , so
with practical constraints and typically satisfied for .
The relationship to original CFG is outlined as follows:
| Approach | Coefficient Form | Constraint |
|---|---|---|
| CFG | ||
| ReCFG | No sum constraint |
3. Computation of Resolution Weights
CFG-RW requires precomputing the ratio
for each condition and timestep . This is achieved through a single-pass Monte Carlo estimate across the dataset:
- Initialize accumulators , , for each .
- For each data sample and each time :
- Draw , set .
- Compute and .
- Accumulate , , .
- After traversal, set , , and .
This lookup table enables efficient runtime, as the coefficients can be retrieved with minimal computational overhead (Xia et al., 24 Oct 2024).
4. Integration with Diffusion Model Samplers
Most state-of-the-art diffusion samplers (e.g., DDIM, Euler–Maruyama, EDM2, SD3) use the following procedure in each denoising step:
1 2 3 4 5 |
for t in timesteps:
eps_c = model(x_t, c, t)
eps_u = model(x_t, t) # unconditional
eps = beta_c * eps_c + beta_u * eps_u
x_{t-1} = SamplerStep(x_t, eps, t) |
1 2 3 4 5 |
for t in timesteps:
eps_c = model(x_t, c, t)
eps_u = model(x_t, t)
eps = w*eps_c + (-w*rho(t,c))*eps_u
x_{t-1} = SamplerStep(x_t, eps, t) |
5. Empirical Performance and Ablation Highlights
Empirical studies show quantifiable gains in both fidelity and conditional faithfulness:
| Model/Dataset | Standard CFG | ReCFG (CFG-RW) | Metric | Change |
|---|---|---|---|---|
| LDM, ImageNet 256×256, 20 steps | FID ≈ 18.9 | FID ≈ 16.9 | FID | ↓ 2.0 |
| EDM2-S, ImageNet 512×512, 63 steps | FID ≈ 5.9 | FID ≈ 4.8 | FID | ↓ 1.1 |
| SD3, CC12M 512×512, 25 steps | CLIP ≈ 0.268, FID ≈ 72.2 | CLIP ≈ 0.270, FID ≈ 71.8 | CLIP, FID | ↑0.002, ↓0.4 |
Ablation reveals:
- Lookup table estimates saturate in performance after ≈300 traversals per condition.
- The mean ratio varies minimally across , justifying use of a global or average for open-vocabulary text models with negligible loss (CLIP-Score loss ≤ 0.001).
- Storing pixel-wise yields marginal additional benefit over a scalar per .
A one-dimensional Gaussian toy example demonstrates that standard CFG systematically shifts the mean, while ReCFG retrieves the exact mean and reduces variance (Xia et al., 24 Oct 2024).
6. Practical Recommendations
- Guidance strength mediates the trade-off between fidelity and diversity, with optimal performance at –$5$ for class-conditional models and higher for open-vocabulary prompting.
- In practice, , so with minor corrections; thus, ReCFG closely approximates boosting the conditional branch by while ensuring zero expectation in the unconditional branch.
- For high-resolution synthesis, as , , allowing for stability in late denoising steps.
- ReCFG can be rapidly implemented post-hoc for any pretrained conditional diffusion model with negligible computational overhead and consistent performance gains.
7. Significance and Implications
CFG Resolution Weighting enforces the theoretical zero-mean property absent in standard CFG by removing the linear coefficient constraint, aligning the sampling process with the requirements of diffusion SDE/ODE theory. The post-hoc nature and closed-form solution for the rectified coefficients permit integration without retraining or architecture changes. Empirical results indicate systematic improvements in FID and conditional alignment for both class-labeled and open-vocabulary generative tasks. The minimal variation in across conditions suggests potential for further optimization in lookup-table storage and runtime efficiency. A plausible implication is that this approach may generalize beyond diffusion samplers currently demonstrated, offering a template for theoretical corrections to guidance heuristics in other generative domains (Xia et al., 24 Oct 2024).