Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 66 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Score Distillation of Flow Matching Models (2509.25127v1)

Published 29 Sep 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.

Summary

  • The paper introduces a unified framework that demonstrates the equivalence between Gaussian diffusion and flow matching objectives using score distillation techniques.
  • It extends Score identity Distillation (SiD) to flow-matching models, achieving efficient, high-quality text-to-image synthesis in just four sampling steps.
  • Empirical evaluations reveal competitive FID, CLIP, and GenEval scores across models ranging from 0.6B to 12B parameters, showcasing robust scalability.

Score Distillation of Flow Matching Models: A Unified Framework for Fast Text-to-Image Generation

Introduction and Motivation

Diffusion models have established state-of-the-art performance in image synthesis, but their slow iterative sampling remains a significant bottleneck for practical deployment. Recent advances in distillation have enabled one- or few-step generation, dramatically accelerating inference. Flow matching, originally proposed as a distinct generative modeling paradigm, has been shown to be theoretically equivalent to diffusion under Gaussian assumptions. This equivalence raises a critical question: can the highly effective score distillation techniques developed for diffusion models be directly applied to flow-matching models, particularly for text-to-image (T2I) generation with large-scale architectures such as DiT, SANA, SD3, and FLUX?

This work provides a rigorous theoretical and empirical investigation of this question. The authors present a unified derivation—eschewing SDE/ODE formalism in favor of Bayes’ rule and conditional expectations—that demonstrates the equivalence of Gaussian diffusion and flow matching objectives, up to differences in loss weighting and timestep scheduling. Building on this, they extend Score identity Distillation (SiD) to a broad class of pretrained flow-matching T2I models, showing that with minimal adaptation, SiD can distill these models into efficient four-step generators in both data-free and data-aided settings.

Theoretical Unification of Diffusion and Flow Matching

The core theoretical contribution is a derivation that unifies the objectives of Gaussian diffusion and flow matching models. All such models corrupt data via a linear combination of the clean image and Gaussian noise:

xt=αtx0+σtϵ,ϵN(0,I)x_t = \alpha_t x_0 + \sigma_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where the @@@@1@@@@ (SNR) SNRt=αt2/σt2\text{SNR}_t = \alpha_t^2 / \sigma_t^2 monotonically decreases with tt.

The authors show that the optimal solutions for x0x_0-prediction, ϵ\epsilon-prediction, vv-prediction, and velocity-prediction in rectified flow are all linear transformations of the conditional mean E[x0xt]\mathbb{E}[x_0 | x_t]. The only substantive difference between these formulations is the weighting of timesteps in the training loss, which determines the effective influence of each tt on the learned parameters.

This is formalized by expressing the overall loss as:

Lϕ=Etp(t)Extp(xt)[wtαt2σt2Lϕ(xt)]L_\phi = \mathbb{E}_{t \sim p(t)} \mathbb{E}_{x_t \sim p(x_t)}\left[ w_t \cdot \frac{\alpha_t^2}{\sigma_t^2} L_\phi(x_t) \right]

where wtw_t is a weighting factor and p(t)p(t) is the timestep distribution. The product wtp(t)w_t p(t) defines the effective weight-normalized distribution π(t)\pi(t) over timesteps. The practical implication is that, for Gaussian-based models, the choice of wtw_t and p(t)p(t)—not the underlying generative process—determines empirical performance differences. Figure 1

Figure 1: Density plots of various noise schedules mapped to t(0,1)t \in (0, 1) by aligning their SNR, illustrating the effect of different weighting schemes on the effective timestep distribution.

Score Distillation for Flow-Matching Models

The SiD framework, previously validated for diffusion models, is extended to flow-matching models with DiT backbones. The key insight is that the teacher’s x0x_0-prediction can be recovered from the velocity prediction via:

fϕ(xt,t,c)=xttvϕFM(xt,t,c)f_{\phi}(x_t, t, c) = x_t - t v_{\phi}^{\text{FM}}(x_t, t, c)

where cc is the text condition. Classifier-free guidance (CFG) is incorporated by linearly combining conditional and unconditional predictions, with a default scale of 4.5.

The distilled generator is trained using Fisher divergence minimization in a data-free setting, alternating updates between the generator and a “fake” flow-matching network initialized from the teacher. The generator loss is:

Lθ(xt(k))=wt(fϕ(xt(k),tk,c)fψ(xt(k),tk,c))T(fψ(xt(k),tk,c)xg(k))L_{\theta}(x_t^{(k)}) = w_t \left(f_{\phi}(x_t^{(k)}, t_k, c) - f_{\psi}(x_t^{(k)}, t_k, c)\right)^{T} \left(f_{\psi}(x_t^{(k)}, t_k, c) - x_g^{(k)}\right)

where wt=1tw_t = 1 - t by default.

When additional data are available, adversarial learning is incorporated via DiffusionGAN, using spatial pooling of DiT features for the discriminator, without introducing extra parameters.

Empirical Evaluation

The SiD-DiT framework is evaluated on a diverse set of flow-matching T2I models: SANA (rectified flow and TrigFlow), SD3-Medium, SD3.5-Medium, SD3.5-Large, and FLUX.1-dev, spanning 0.6B to 12B parameters. The experiments demonstrate:

  • Data-free distillation: SiD-DiT achieves FID, CLIP, and GenEval scores comparable to or better than the teacher models, with only four sampling steps.
  • Adversarial enhancement: Incorporating additional data via adversarial loss (SiD2a_2^a-DiT) further reduces FID, especially for SANA and SD3 models.
  • Scalability: The method scales to large models (e.g., SD3.5-Large, FLUX.1-dev) using BF16 and FSDP, with a single codebase and minimal hyperparameter tuning.
  • Robustness: The same training configuration is effective across all tested architectures and parameter scales. Figure 2

    Figure 2: Qualitative results of the four-step SiD-DiT generator distilled from SD3.5-Large in a data-free setting.

    Figure 3

    Figure 3: Qualitative results of the four-step SiD-DiT generator distilled from FLUX-1.DEV in a data-free setting.

    Figure 4

    Figure 4: Qualitative results of the four-step SiD-DiT and SiD2a_2^a-DiT generators distilled from SD3-Medium, compared against Flash Diffusion SD3 and the teacher model SD3-Medium.

    Figure 5

    Figure 5: Qualitative results of the four-step SiD-DiT and SiD2a_2^a-DiT generators distilled from SD3.5-Medium, compared against SD3.5-Turbo-Medium and the teacher model SD3.5-Medium.

    Figure 6

    Figure 6: Qualitative results from the four-step SiD-DiT and SiD2a_2^a-DiT generators distilled from SD3.5-Large, compared against SD3.5-Turbo-Large and the teacher SD3.5-Large.

    Figure 7

    Figure 7: Qualitative results from the four-step SiD-DiT and SiD2a_2^a-DiT generators distilled from the SANA-Sprint teacher (1.6B), compared against SANA-Sprint 1.6B and the teacher.

Analysis of Loss Reweighting and Timestep Scheduling

The empirical paper of loss reweighting reveals that restricting the generator loss to higher tt intervals (i.e., heavier noise) yields visually appealing but less detailed images, while lower tt intervals enhance detail at the expense of vividness. The chosen combination of p(t)p(t) and wtw_t in SiD-DiT provides full coverage over tt, balancing these trade-offs and delivering strong performance across all tested models. Figure 8

Figure 8

Figure 8

Figure 8

Figure 8

Figure 8

Figure 8: Qualitative generations from distilled SANA by restricting tt to different intervals, illustrating the effect of loss reweighting on image characteristics.

Implementation Considerations

  • Precision and memory: AMP and BF16 are used to maximize throughput and minimize memory usage, with BF16 required for the largest models.
  • Distributed training: FSDP is employed for efficient multi-GPU training.
  • Data-free operation: SiD-DiT requires no real images, relying solely on the teacher model for supervision.
  • Adversarial extension: When additional data are available, adversarial loss can be incorporated without architectural changes.

Implications and Future Directions

This work provides the first systematic evidence that score distillation applies broadly to flow-matching T2I models, resolving prior concerns about stability and soundness. The unified theoretical perspective clarifies that practical differences between diffusion and flow matching arise primarily from loss weighting and timestep scheduling, not from the generative process itself.

Practically, this enables the rapid distillation of large-scale T2I models into efficient few-step generators, facilitating deployment in latency-sensitive applications. The approach is robust across architectures and parameter scales, requiring minimal adaptation.

Theoretically, the results suggest that future research on generative modeling and fast sampling can focus on unified frameworks that abstract away from the specifics of diffusion or flow matching, instead optimizing the effective weighting of training losses and timestep schedules.

Potential future developments include:

  • Tailoring distillation objectives to model-specific guidance mechanisms (e.g., learned guidance in FLUX).
  • Systematic exploration of loss reweighting and timestep scheduling for further performance gains.
  • Extending the framework to non-Gaussian or non-linear generative processes.

Conclusion

The paper establishes a unified theoretical and empirical foundation for score distillation in flow-matching models, demonstrating that efficient, high-quality, few-step text-to-image generation is achievable across a wide range of architectures with a single, robust framework. This work bridges the gap between diffusion and flow matching, providing a practical path forward for scalable, accelerated generative modeling.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

alphaXiv

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube