Zero-CFG: Optimized Guidance for Flow Models
- Zero-CFG is an enhanced classifier-free guidance method that integrates optimized scaling and zero-init to refine velocity corrections in flow matching models.
- It significantly improves sample fidelity and controllability during early generation stages by mitigating estimation error compared to standard CFG.
- Empirical benchmarks in image and video synthesis showcase quantitative gains (e.g., lower FID scores) with negligible computational overhead.
Zero-CFG (CFG-Zero*) is an enhanced classifier-free guidance (CFG) method for conditional generative models using flow matching. By addressing limitations in the original CFG approach, Zero-CFG introduces “optimized scale” and “zero-init” mechanisms to improve both sample fidelity and controllability—especially during early generation stages where standard CFG can degrade performance. The method demonstrates significant quantitative and qualitative improvements across image and video generation benchmarks, with negligible computational overhead (Fan et al., 24 Mar 2025).
1. Classifier-Free Guidance in Flow Matching Models
Classifier-free guidance (CFG) is a widely used inference-time strategy in diffusion and flow-based generative models. In this setting, generative sampling is formulated as a time-indexed ODE (or SDE) trajectory () transforming samples from a simple prior to a structured data distribution . The model is trained to estimate a velocity field via
where denotes a condition such as a text prompt or class label. Learning proceeds by minimizing
CFG extends this framework by jointly learning a conditional branch and an unconditional branch within the same network. At inference, a guidance-weighted blend constructs
0
with 1 amplifying the influence of the condition 2. This increases alignment to the condition while risking out-of-distribution artifacts.
2. Analysis: Limitations of Standard CFG
When the velocity model is underfitted, especially during early training or at the initial steps in the ODE solver (3), CFG can exacerbate estimation error. For target distributions where ground-truth velocity 4 is available (e.g., Gaussian mixtures), empirical analysis shows
5
meaning that even a zero velocity would outperform the CFG-blended estimate in these early steps. Therefore, standard CFG not only fails to improve guidance but can actively worsen the sample trajectory relative to a naïve baseline.
3. Optimized Scale Mechanism
Zero-CFG introduces an “optimized scale” scalar 6 to reweight the unconditional branch prior to blending. The new guided velocity is
7
with 8 chosen to minimize
9
By closed-form projection,
0
and the final guided velocity can be reparametrized as
1
where 2 and 3.
This mechanism actively projects the conditional velocity toward the unconditional vector direction, ensuring the guidance optimally corrects the model’s estimation error, especially in underfitted regimes.
4. Zero-Init: Early Step Correction
The “zero-init” strategy is motivated by the observation that, for initial timesteps 4, even the optimized guided velocity is inferior to simply zero. Thus, Zero-CFG zeros out the velocity in the first 5 ODE steps:
3
Default 6 is typically sufficient. This procedure guarantees that the model's generated trajectory is not misdirected by severely erroneous velocity estimates in early timesteps, limiting the compounding of error over the full integration.
5. Key Mathematical Formulations
Comparison between standard CFG and Zero-CFG is summarized by the velocity update formulas:
| Variant | Guided Velocity Formula |
|---|---|
| Standard CFG | 7 |
| Zero-CFG | 8 |
where 9.
The “optimized scale” 0 is a closed-form projection factor computed at each step.
6. Empirical Performance and Benchmarks
CFG-Zero* exhibits accelerated and more accurate convergence on toy targets (e.g., 2D Gaussian mixtures), with reduced error norms at 1 compared to standard CFG. On real-world tasks:
- ImageNet-256 with pre-trained SiT-XL (700M parameters):
- Baseline conditional: IS=125.1, FID=9.41
- Standard CFG: IS=257.0, FID=2.23
- ADG: IS=257.9, FID=2.37
- CFG++: IS=257.0, FID=2.25
- Zero-CFG: IS=258.87, FID=2.10, sFID=4.59, Precision=0.80, Recall=0.61
- Text-to-image synthesis: Four leading models (Lumina-Next, SD3, SD3.5, Flux) demonstrate consistent improvements in both Aesthetic and CLIP scores:
- Example (Lumina-Next): Aesthetic 6.85→7.03, CLIP 34.09→34.37
- Compositional and qualitative benchmarks: On T2I-CompBench++ and user studies, Zero-CFG yields 2+0.02–0.04 gains in compositional fidelity, and 72% average user preference over standard CFG (82% on SD3.5 for detail preservation).
- Text-to-video generation (Wan-2.1): On VBench, total score moves from 83.99 (CFG) to 84.06 (Zero-CFG), with increased motion smoothness (+0.92) and spatial relationship accuracy (+1.09).
In all settings, the addition of optimized scaling and zero-init incurs negligible overhead—one dot-product per step and a trivial conditional branch.
7. Implications in Other Modalities and Limitations
CFG variants successful in image generation, including Zero-CFG, do not universally improve synthesis in other modalities such as zero-shot text-to-speech (TTS). Empirical evaluation establishes that techniques like zero-init and advanced reweighting do not yield consistent benefit for TTS, where modality- and text-embedding-specific factors dominate performance. In those domains, selective timestep-dependent CFG is preferred and efficacy is model × language dependent (Zheng et al., 24 Sep 2025).
A plausible implication is that projection-based and initialization corrections like those in Zero-CFG are highly effective for flow-matching image and video models but require modality-specific adaptation for domains with more complex conditional structures or training dynamics.
8. Future Directions
Potential future research directions include:
- Extending optimized projection and initialization techniques to other conditional generative modalities (e.g., audio, molecular generation), with careful calibration to condition representation.
- Unifying Zero-CFG with separated-condition, timestep-adaptive CFG in multimodal settings.
- Systematic analysis of error propagation under various data and model regimes to refine initialization and scaling schedules.
The class of CFG improvements typified by Zero-CFG continues to be a key area for boosting fidelity and controllability in conditional generative modeling. Further innovation will require detailed analysis of both the guidance mechanism and the underlying model’s error profile at each generative timestep (Fan et al., 24 Mar 2025, Zheng et al., 24 Sep 2025).