Unified Consistency Framework for Unified Models

Updated 5 September 2025

The paper establishes that enforcing denoiser consistency via the reverse martingale property is both necessary and sufficient for satisfying the score Fokker–Planck equation.
It unifies distinct approaches—stochastic diffusion, ODE invariance, and FPE penalties—using a key parameter λ to seamlessly transition between regimes.
Practical implications include accelerated sampling, reduced inference cost, and improved evaluation of multimodal and cyclic applications.

The Unified Consistency Framework for Unified Models (UCF-UM) establishes a comprehensive theoretical and empirical foundation for modeling and evaluating generative systems that unify multiple tasks—such as image-to-text (I2T) and text-to-image (T2I)—within a single architecture. By integrating consistency principles across model design, training, and evaluation, UCF-UM provides the mathematical and methodological tools to ensure alignment, stability, and interpretability of unified models across diverse modalities and repeated cross-modal cycles.

1. Theoretical Equivalence of Consistency Notions

UCF-UM is grounded in the rigorous equivalence of several independently developed “consistency” conditions for diffusion-type generative models (Lai et al., 2023). The key notions unified are:

Consistency in the reverse stochastic process: For a denoiser $h(x,t)$ aligned with the reverse SDE, consistency is defined via the reverse martingale property, i.e., $h(x,t) = \mathbb{E}_p[x(t_0) \mid X_t = x]$ .
Consistency along a deterministic trajectory: For an ODE-denoiser along the probability flow ODE, the model requires a functional $f(x(t), t)$ to be invariant along ODE solutions.
Score Fokker–Planck equation (score FPE) regularization: The induced score $s(x,t) = (h(x,t)-x)/\sigma^2(t)$ should satisfy $\partial_t s = \frac{1}{2}g^2(t) [\Delta_x s + (\nabla_x s) \cdot s]$ .

A central parameter $\lambda$ interpolates between SDE and ODE regimes:

$dx(t) = -\frac{1+\lambda}{2}g^2(t)s(x,t)dt + \lambda g(t)d\tilde{w}_t$

Setting $\lambda=1$ recovers the SDE, $\lambda=0$ yields the ODE limit. Under mild regularity, Theorem 1 of (Lai et al., 2023) states that consistency for the SDE-denoiser and ODE-denoiser is equivalent when $\lambda = 0$ , and Theorem 2 proves that enforcing denoiser consistency (via the martingale property) is both necessary and sufficient for the corresponding score to satisfy the score FPE.

This theoretical unification clarifies that various empirical regularizations—whether via mean-squared error across reverse SDE paths, enforcing constant-of-motion constraints along ODEs, or FPE penalties—are manifestations of a single mathematical property.

2. Consistency Models and Methodologies

UCF-UM synthesizes and relates prominent models that enforce these different consistency types:

Model	Characteristic Notion	Primary Regularization Objective
Consistent Diffusion Model (CDM)	SDE-denoiser (stochastic)	Reverse martingale loss: denoiser must predict clean sample consistent across time
Consistency Model (CM)	ODE-denoiser (deterministic)	Enforce invariance along ODE trajectory; enables one-step/few-step sampling
FP-Diffusion	Score FPE	Penalizes residuals of the score Fokker–Planck PDE in training objective

In all three approaches, the denoiser or the score is regularized so as to predict the “clean” data distribution for all time steps, whether through probabilistic integration, ODE invariants, or PDE satisfaction.

Models such as UniCMs (Xu et al., 8 Feb 2025) further generalize this to the multimodal case. Discrete denoising for images (via masked token prediction) and for text (via Jacobi parallelization of autoregressive decoding) are both unified as denoising trajectories. Consistency distillation then aligns the student’s prediction at any stage in the trajectory with the ground-truth output generated by the teacher at the “clean” endpoint.

3. Unified Framework: Training and Sampling

The unified framework for training and sampling is realized by parameterizing the training objective with a “consistency ratio” parameter $\lambda \in [0,1]$ , as proposed in Unified Continuous Generative Models (UCGM) (Sun et al., 12 May 2025):

$\mathcal{L}(\theta) = \mathbb{E}_{(\mathbf{z},\mathbf{x})\sim p(\mathbf{z},\mathbf{x}), t}\left[\frac{1}{\hat{\omega}(t)}\left\|\mathbf{f}^{\mathbf{x}}(\mathbf{F}_\theta(\mathbf{x}_t,t), \mathbf{x}_t, t) - \mathbf{f}^{\mathbf{x}}(\mathbf{F}_\theta(\mathbf{x}_{\lambda t},\lambda t), \mathbf{x}_{\lambda t}, \lambda t)\right\|_2^2\right]$

By dialing $\lambda$ from $0$ (multi-step diffusion/flow matching) to $1$ (few-step consistency), the same model can be trained/interpreted within either regime, dissolving the divide between the paradigms. The framework adopts a unified transport coefficient formalism to handle various noise schedules and generation objectives, and incorporates a self-boosting mechanism to enhance stability near the consistency (few-step) limit:

$\mathbf{x}^\star \gets \mathbf{x} + \zeta\Bigl(\mathbf{f}^{\mathbf{x}}(\mathbf{F}_\theta(\mathbf{x}_t, t), \mathbf{x}_t, t)-\mathbf{f}^{\mathbf{x}}(\mathbf{F}_\theta^\varnothing (\mathbf{x}_t, t), \mathbf{x}_t, t)\Bigr)$

This allows the model to achieve high sample quality with substantially fewer evaluations.

On the sampling side, a unified iterative scheme decomposes and reconstructs the sample at each time using the estimated clean and noise components, further stabilized by extrapolation and stochasticity.

4. Multimodal and Cyclic Applications

UCF-UM extends naturally to unified multimodal models and cyclic evaluation settings. In UniCMs, both text and image are cast as denoising trajectories over discrete tokens, with consistency distillation enforcing alignment at every stage (Xu et al., 8 Feb 2025).

To assess the preservation of shared semantics over repeated cross-modal conversions, UCF-UM specifies a cyclic “Telephone Game” protocol (Mollah et al., 4 Sep 2025). Alternating between I2T and T2I, models are evaluated for semantic drift by tracking embedding similarities and object-level task compliance over multiple generations.

The cyclic protocol introduces these primary metrics:

Metric	Description
MCD	Mean Cumulative Drift: aggregated embedding similarity across cycles
SDR	Semantic Drift Rate: power-law decay parameter for similarity loss
MGG	Multi-Generation GenEval: average per-generation object-level GenEval

Explicitly, mean cumulative drift is defined as

$MCD_{\delta} = \frac{1}{G} \sum_{g=1}^G S_{\delta}(g), \qquad S_{\delta}(g) = \frac{1}{|\mathcal{D}|} \sum_{d \in \mathcal{D}} \mathrm{sim}(\mathrm{inp}_d, M^{(g)}_{d,\delta})$

where the evaluation is run for $G$ generations and over all dataset items $\mathcal{D}$ .

This setting exposes semantic mismatches between a model’s generative and analytic components not observed in single-pass benchmarks, revealing—for example—that high single-step GenEval or FID scores do not guarantee stability under cyclic modality transitions.

5. Evaluation Protocols and Benchmarks

UCF-UM motivates new evaluation methodologies beyond single-stage benchmarks. UniEval (Li et al., 15 May 2025) offers a unified protocol encompassing both generation and understanding, structured via the UniBench dataset (1,234 prompts, 4,231 QA pairs, 13 top-level and 81 fine-grained tags) and the UniScore metric:

$s = (1/n) \sum_{i=1}^n s^1_i, \quad s^1_i = (1/m) \sum_{j=1}^m s^2_j, \quad s^2_j = (1/k) \sum_{l=1}^k o_l$

Here, $o_\ell$ aggregates per-option binary performance, recursively aggregated over tags and test cases, ensuring that self-consistency between a model’s generation and understanding is directly measured.

Cyclic consistency evaluation, as in the “Telephone Game” protocol, is paired with new benchmarks such as ND400 (combining out-of-distribution and fine-grained cases), exposing substantial variation in cross-modal semantic retention even for models with comparable performance on static benchmarks (Mollah et al., 4 Sep 2025).

6. Practical Implications and Future Directions

The Unified Consistency Framework elucidates the shared underpinnings of accelerated sampling, high sample quality, and improved likelihood estimation in unified generative models. Key practical implications include:

Training efficiency and flexibility: The same unified loss function and model architecture supports a spectrum from slow, highly accurate multi-step sampling to ultra-fast, few-step or even one-step generation.
Reduced inference cost: Techniques such as trajectory segmentation and curriculum learning enable shorter sampling paths without sacrificing output fidelity.
Broader model generalizability: By evaluating and regularizing for cyclic consistency, UCF-UM supports deployment where cross-modal round-trip consistency is critical, such as in iterative editing and robust content understanding.

Recommended future research directions include:

Developing seamless hybrid regularizers to blend DSM, martingale, and FPE penalties for robust, unified optimization.
Extending consistency unification to new generative classes beyond SDEs/ODEs, such as Schrödinger bridge models or latent ODEs.
Investigating optimization landscapes and stability conditions attendant to unified regularization in high-dimensional models.
Expanding evaluation protocols to further quantify robustness and cross-domain generalizability.

7. Open Source Resources and Community Adoption

Implementation code for evaluation, model training, and benchmarking following UCF-UM principles is released at:

https://github.com/zhijie-group/Show-o-Turbo [Show-o Turbo/UniCMs]
https://github.com/LINs-lab/UCGM [UCGM-T/S]
https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models [Telephone Game/UCF-UM evaluation]

These resources enable replication and further extension, allowing researchers to apply UCF-UM aligned methodologies across a broad spectrum of generative modeling and multimodal tasks.

UCF-UM systematically unifies disparate approaches to consistency in generative modeling, situating them within a common mathematical and methodological framework that informs both the construction of state-of-the-art models and the development of comprehensive, reliability-focused evaluation protocols. This integration advances the theory and practice of unified generative modeling across modalities and use cases, providing a robust basis for both empirical performance and statistical interpretability (Lai et al., 2023, Xu et al., 8 Feb 2025, Sun et al., 12 May 2025, Li et al., 15 May 2025, Mollah et al., 4 Sep 2025).