Unified Continuous Generative Models (2505.07447v2)

Published 12 May 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.

Collections

Summary

The paper presents UCGM, a novel framework that unifies training and sampling for diffusion, flow-matching, and consistency models using a tunable consistency ratio.
The paper details techniques such as second-order difference estimation for gradient stabilization and Beta distribution-based time sampling to enhance training efficiency.
The paper demonstrates that UCGM-S, the unified sampling algorithm, reduces the number of function evaluations while improving image fidelity, outperforming traditional methods.

This paper, "Unified Continuous Generative Models" (2505.07447), introduces a novel framework, UCGM, designed to unify the training, sampling, and understanding of various continuous generative models, including diffusion, flow-matching, and consistency models. The core motivation is to address the current fragmentation in the field, where these models are often treated distinctly, leading to disparate methodologies and hindering cross-pollination of research.

The UCGM framework comprises two main components: UCGM-T (Trainer) and UCGM-S (Sampler).

UCGM-T: Unified Training

UCGM-T is built upon a unified training objective parameterized by a consistency ratio $\lambda \in [0,1]$ . This ratio allows a single training setup to produce models suitable for different inference regimes:

When $\lambda$ is close to 0, the model behaves like traditional multi-step diffusion or flow-matching models.
As $\lambda$ approaches 1, the model transitions towards few-step consistency-like models.

The unified objective is formulated based on learning target predictions derived from unified transport coefficients $\alpha(t), \gamma(t), \hat{\alpha}(t), \hat{\gamma}(t)$ , which operate over a continuous time interval $t \in [0,1]$ . The paper shows how existing paradigms like EDM, Optimal Transport Flow Matching, and Simplifying Consistency Models can be seen as specific parameterizations within this unified framework (see Table 1).

Practical implementation of UCGM-T involves several key techniques:

Stabilizing Gradient as $\lambda \to 1$ : Training few-step models can be unstable. The paper proposes using a second-order difference estimation for the derivative term in the loss and applying numerical truncation (clipping to $[-1, 1]$ ) to stabilize gradients, particularly with FP16 precision.
Unified Distribution Transformation of Time: Instead of uniform sampling or complex non-linear transforms, the paper samples time $t$ from a Beta distribution $\mathrm{Beta}(\theta_1, \theta_2)$ . By adjusting $\theta_1$ and $\theta_2$ , various time sampling distributions used in prior work (e.g., logit-normal, uniform) can be approximated, performing importance sampling to accelerate training.
Learning Enhanced Score Function: To reduce reliance on computationally expensive Classifier-Free Guidance (CFG) during inference, UCGM-T incorporates a training-time self-boosting mechanism. This involves modifying the target score function based on conditional and unconditional model outputs, guiding the training towards generating high-fidelity samples even without CFG. An efficient approximation using a pre-trained multi-step model is proposed for training few-step models.

Algorithm 1 provides the detailed steps for UCGM-T, outlining the sampling of data and time, computation of enhanced targets based on the consistency ratio $\lambda$ and enhancement ratio $\zeta$ , and the loss calculation using the defined transport coefficients.

UCGM-S: Unified Sampling

UCGM-S is a flexible sampling algorithm designed to work with models trained by UCGM-T and also to accelerate sampling from pre-trained models developed under distinct prior paradigms. The core iterative sampling process involves:

Decomposition: At time $t$ , the current sample $\tilde{\xx}_t$ is decomposed into estimated clean ($\hat{\xx}_t$) and noise ($\hat{\zz}_t$) components using the model output $\mmF_{\mtheta}(\tilde{\xx}_t, t)$ and the inverse transform functions $\mf^{\mathrm{\xx}}, \mf^{\mathrm{\zz}}$.
Reconstruction: The sample for the next time step $t'$ is reconstructed by combining the estimated components using the transport coefficients: $\tilde{\xx}_{t'} = \alpha(t') \cdot \hat{\zz}_t + \gamma(t') \cdot \hat{\xx}_t$.

Two enhancement techniques are introduced in UCGM-S:

Extrapolating the Estimation: Inspired by guidance techniques, the algorithm extrapolates the estimates $\hat{\xx}_t$ and $\hat{\zz}_t$ using previous estimates from time $t-1$ . This is done via $\hat{\xx} \gets \hat{\xx}_i + \kappa \cdot (\hat{\xx}_i - \hat{\xx}_{i-1})$ and $\hat{\zz} \gets \hat{\zz}_i + \kappa \cdot (\hat{\zz}_i - \hat{\zz}_{i-1})$, where $\kappa$ is the extrapolation ratio. This self-boosting at sampling time significantly improves generation fidelity and reduces the required number of steps, without additional model evaluations.
Incorporating Stochasticity: A stochastic term is added during reconstruction: $\tilde{\xx}_{t'} = \alpha(t') \cdot (\sqrt{1-\rho} \cdot \hat{\zz}_t + \sqrt{\rho} \cdot \zz) + \gamma(t') \cdot \hat{\xx}_t$, where $\zz$ is Gaussian noise and $\rho$ is the stochasticity ratio. This enhances sample diversity. Empirically, setting $\rho=\lambda$ is found to work well.

Algorithm 2 details the UCGM-S algorithm, including initialization, the iterative decomposition and reconstruction steps, extrapolation with $\kappa$ , and incorporation of stochasticity with $\rho$ . It supports different orders of solvers ( $\nu=1, 2$ ). The paper demonstrates that classical samplers are special cases of UCGM-S.

Experimental Results

The paper presents extensive experiments on ImageNet-1K ( $512 \times 512$ and $256 \times 256$ ) and CIFAR-10 ( $32 \times 32$ ). Key findings include:

UCGM-S as a Plug-and-Play Accelerator: UCGM-S can be applied to models trained with prior methods (like EDM2, DDT, REPA-E) to significantly reduce NFEs while maintaining or improving FID. For instance, applying UCGM-S to a pre-trained EDM2-XXL ( $1.5\mathrm{B}$ params, $512\times512$ ) improves FID from $1.91$ to $1.88$ while reducing NFEs from $63$ to $40$. For REPA-E-XL ( $675\mathrm{M}$ params, $256\times256$ ), FID improves from $1.26$ to $1.06$ with NFEs reduced from $500$ to $80$.
UCGM-T + UCGM-S Synergy (Multi-step $\lambda=0$ ): Training and sampling with UCGM achieves SOTA or competitive results at low NFEs. For $256\times256$ with E2E-VAE, a $675\mathrm{M}$ model trained with UCGM-T achieved $1.21$ FID at $40$ NFEs, outperforming previous SOTA methods requiring much higher NFEs. Even at $20$ NFEs, it achieved $1.30$ FID.
UCGM-T + UCGM-S Synergy (Few-step $\lambda=1$ ): In the few-step regime, UCGM achieves SOTA performance, surpassing specialized consistency models and GANs. For $512\times512$ with DC-AE, a $675\mathrm{M}$ model reached $1.75$ FID with only $2$ NFEs, better than sCD-XXL ( $1.5\mathrm{B}$ params, $1.88$ FID, $2$ NFEs). For $256\times256$ with VA-VAE, it achieved $1.42$ FID at $2$ NFEs, significantly better than IMM-XL/2 ($1.99$ FID, $16$ NFEs).
Ablation Studies: The studies demonstrate that the consistency ratio $\lambda$ controls the optimal NFE range for sampling (high $\lambda$ favors low NFE). The enhanced training objective consistently improves performance. The extrapolation ratio $\kappa$ in sampling is particularly beneficial for very low NFE scenarios, with mid-range values ($0.25-0.5$) offering good speed-quality trade-offs.

Practical Implications

The UCGM framework offers several significant practical benefits:

Efficiency: Both training and inference (sampling) can be made significantly more efficient. UCGM-T includes training-time optimizations (time transformation, self-boosting), and UCGM-S provides sampling acceleration methods compatible with various models.
Reduced Computational Cost: Generating high-fidelity images requires substantially fewer model evaluations compared to many existing methods, crucial for large-scale deployment and real-time applications.
Unified Development: Providing a single framework and objective for different types of continuous generative models simplifies research and development, allowing advancements in one area to potentially benefit others.
Compatibility: UCGM-S can be applied as a plug-and-play accelerator for existing pre-trained models, increasing their practical utility without requiring retraining.
High Fidelity without CFG: The training-time self-boosting helps models generate high-quality samples directly, reducing or eliminating the need for expensive CFG during inference.

The paper's open-source code implementation at https://github.com/LINs-lab/UCGM facilitates practical application and further research into unified generative modeling.