- The paper presents UCGM, a novel framework that unifies training and sampling for diffusion, flow-matching, and consistency models using a tunable consistency ratio.
- The paper details techniques such as second-order difference estimation for gradient stabilization and Beta distribution-based time sampling to enhance training efficiency.
- The paper demonstrates that UCGM-S, the unified sampling algorithm, reduces the number of function evaluations while improving image fidelity, outperforming traditional methods.
This paper, "Unified Continuous Generative Models" (2505.07447), introduces a novel framework, UCGM, designed to unify the training, sampling, and understanding of various continuous generative models, including diffusion, flow-matching, and consistency models. The core motivation is to address the current fragmentation in the field, where these models are often treated distinctly, leading to disparate methodologies and hindering cross-pollination of research.
The UCGM framework comprises two main components: UCGM-T (Trainer) and UCGM-S (Sampler).
UCGM-T: Unified Training
UCGM-T is built upon a unified training objective parameterized by a consistency ratio λ∈[0,1]. This ratio allows a single training setup to produce models suitable for different inference regimes:
- When λ is close to 0, the model behaves like traditional multi-step diffusion or flow-matching models.
- As λ approaches 1, the model transitions towards few-step consistency-like models.
The unified objective is formulated based on learning target predictions derived from unified transport coefficients α(t),γ(t),α^(t),γ^(t), which operate over a continuous time interval t∈[0,1]. The paper shows how existing paradigms like EDM, Optimal Transport Flow Matching, and Simplifying Consistency Models can be seen as specific parameterizations within this unified framework (see Table 1).
Practical implementation of UCGM-T involves several key techniques:
- Stabilizing Gradient as λ→1: Training few-step models can be unstable. The paper proposes using a second-order difference estimation for the derivative term in the loss and applying numerical truncation (clipping to [−1,1]) to stabilize gradients, particularly with FP16 precision.
- Unified Distribution Transformation of Time: Instead of uniform sampling or complex non-linear transforms, the paper samples time t from a Beta distribution Beta(θ1,θ2). By adjusting θ1 and θ2, various time sampling distributions used in prior work (e.g., logit-normal, uniform) can be approximated, performing importance sampling to accelerate training.
- Learning Enhanced Score Function: To reduce reliance on computationally expensive Classifier-Free Guidance (CFG) during inference, UCGM-T incorporates a training-time self-boosting mechanism. This involves modifying the target score function based on conditional and unconditional model outputs, guiding the training towards generating high-fidelity samples even without CFG. An efficient approximation using a pre-trained multi-step model is proposed for training few-step models.
Algorithm 1 provides the detailed steps for UCGM-T, outlining the sampling of data and time, computation of enhanced targets based on the consistency ratio λ and enhancement ratio ζ, and the loss calculation using the defined transport coefficients.
UCGM-S: Unified Sampling
UCGM-S is a flexible sampling algorithm designed to work with models trained by UCGM-T and also to accelerate sampling from pre-trained models developed under distinct prior paradigms. The core iterative sampling process involves:
- Decomposition: At time t, the current sample $\tilde{\xx}_t$ is decomposed into estimated clean ($\hat{\xx}_t$) and noise ($\hat{\zz}_t$) components using the model output $\mmF_{\mtheta}(\tilde{\xx}_t, t)$ and the inverse transform functions $\mf^{\mathrm{\xx}}, \mf^{\mathrm{\zz}}$.
- Reconstruction: The sample for the next time step t′ is reconstructed by combining the estimated components using the transport coefficients: $\tilde{\xx}_{t'} = \alpha(t') \cdot \hat{\zz}_t + \gamma(t') \cdot \hat{\xx}_t$.
Two enhancement techniques are introduced in UCGM-S:
- Extrapolating the Estimation: Inspired by guidance techniques, the algorithm extrapolates the estimates $\hat{\xx}_t$ and $\hat{\zz}_t$ using previous estimates from time t−1. This is done via $\hat{\xx} \gets \hat{\xx}_i + \kappa \cdot (\hat{\xx}_i - \hat{\xx}_{i-1})$ and $\hat{\zz} \gets \hat{\zz}_i + \kappa \cdot (\hat{\zz}_i - \hat{\zz}_{i-1})$, where κ is the extrapolation ratio. This self-boosting at sampling time significantly improves generation fidelity and reduces the required number of steps, without additional model evaluations.
- Incorporating Stochasticity: A stochastic term is added during reconstruction: $\tilde{\xx}_{t'} = \alpha(t') \cdot (\sqrt{1-\rho} \cdot \hat{\zz}_t + \sqrt{\rho} \cdot \zz) + \gamma(t') \cdot \hat{\xx}_t$, where $\zz$ is Gaussian noise and ρ is the stochasticity ratio. This enhances sample diversity. Empirically, setting ρ=λ is found to work well.
Algorithm 2 details the UCGM-S algorithm, including initialization, the iterative decomposition and reconstruction steps, extrapolation with κ, and incorporation of stochasticity with ρ. It supports different orders of solvers (ν=1,2). The paper demonstrates that classical samplers are special cases of UCGM-S.
Experimental Results
The paper presents extensive experiments on ImageNet-1K (512×512 and 256×256) and CIFAR-10 (32×32). Key findings include:
- UCGM-S as a Plug-and-Play Accelerator: UCGM-S can be applied to models trained with prior methods (like EDM2, DDT, REPA-E) to significantly reduce NFEs while maintaining or improving FID. For instance, applying UCGM-S to a pre-trained EDM2-XXL (1.5B params, 512×512) improves FID from $1.91$ to $1.88$ while reducing NFEs from $63$ to $40$. For REPA-E-XL (675M params, 256×256), FID improves from $1.26$ to $1.06$ with NFEs reduced from $500$ to $80$.
- UCGM-T + UCGM-S Synergy (Multi-step λ=0): Training and sampling with UCGM achieves SOTA or competitive results at low NFEs. For 256×256 with E2E-VAE, a 675M model trained with UCGM-T achieved $1.21$ FID at $40$ NFEs, outperforming previous SOTA methods requiring much higher NFEs. Even at $20$ NFEs, it achieved $1.30$ FID.
- UCGM-T + UCGM-S Synergy (Few-step λ=1): In the few-step regime, UCGM achieves SOTA performance, surpassing specialized consistency models and GANs. For 512×512 with DC-AE, a 675M model reached $1.75$ FID with only $2$ NFEs, better than sCD-XXL (1.5B params, $1.88$ FID, $2$ NFEs). For 256×256 with VA-VAE, it achieved $1.42$ FID at $2$ NFEs, significantly better than IMM-XL/2 ($1.99$ FID, $16$ NFEs).
- Ablation Studies: The studies demonstrate that the consistency ratio λ controls the optimal NFE range for sampling (high λ favors low NFE). The enhanced training objective consistently improves performance. The extrapolation ratio κ in sampling is particularly beneficial for very low NFE scenarios, with mid-range values ($0.25-0.5$) offering good speed-quality trade-offs.
Practical Implications
The UCGM framework offers several significant practical benefits:
- Efficiency: Both training and inference (sampling) can be made significantly more efficient. UCGM-T includes training-time optimizations (time transformation, self-boosting), and UCGM-S provides sampling acceleration methods compatible with various models.
- Reduced Computational Cost: Generating high-fidelity images requires substantially fewer model evaluations compared to many existing methods, crucial for large-scale deployment and real-time applications.
- Unified Development: Providing a single framework and objective for different types of continuous generative models simplifies research and development, allowing advancements in one area to potentially benefit others.
- Compatibility: UCGM-S can be applied as a plug-and-play accelerator for existing pre-trained models, increasing their practical utility without requiring retraining.
- High Fidelity without CFG: The training-time self-boosting helps models generate high-quality samples directly, reducing or eliminating the need for expensive CFG during inference.
The paper's open-source code implementation at https://github.com/LINs-lab/UCGM
facilitates practical application and further research into unified generative modeling.