Flow Generator Matching (2410.19310v1)
Abstract: In the realm of Artificial Intelligence Generated Content (AIGC), flow-matching models have emerged as a powerhouse, achieving success due to their robust theoretical underpinnings and solid ability for large-scale generative modeling. These models have demonstrated state-of-the-art performance, but their brilliance comes at a cost. The process of sampling from these models is notoriously demanding on computational resources, as it necessitates the use of multi-step numerical ordinary differential equations (ODEs). Against this backdrop, this paper presents a novel solution with theoretical guarantees in the form of Flow Generator Matching (FGM), an innovative approach designed to accelerate the sampling of flow-matching models into a one-step generation, while maintaining the original performance. On the CIFAR10 unconditional generation benchmark, our one-step FGM model achieves a new record Fr\'echet Inception Distance (FID) score of 3.08 among few-step flow-matching-based models, outperforming original 50-step flow-matching models. Furthermore, we use the FGM to distill the Stable Diffusion 3, a leading text-to-image flow-matching model based on the MM-DiT architecture. The resulting MM-DiT-FGM one-step text-to-image model demonstrates outstanding industry-level performance. When evaluated on the GenEval benchmark, MM-DiT-FGM has delivered remarkable generating qualities, rivaling other multi-step models in light of the efficiency of a single generation step.
- Fast inference in denoising diffusion models via mmd finetuning. ArXiv, abs/2301.07969, 2023.
- Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022.
- Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.
- Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- Flash diffusion: Accelerating any conditional diffusion model for few steps image generation. arXiv preprint arXiv:2406.02347, 2024.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
- Neural Ordinary Differential Equations. In Advances in neural information processing systems, pp. 6571–6583, 2018.
- Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp. 9916–9926, 2019.
- Diffedit: Diffusion-based semantic image editing with mask guidance. ArXiv, abs/2210.11427, 2022.
- Variational schr\\\backslash\” odinger diffusion models. arXiv preprint arXiv:2405.04795, 2024.
- Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
- Stable audio open. arXiv preprint arXiv:2407.14358, 2024.
- Optimizing ddpm sampling with shortcut fine-tuning. ArXiv, abs/2301.13362, 2023.
- Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. ArXiv, abs/2305.16381, 2023.
- A lipschitz bandits approach for continuous hyperparameter optimization. arXiv preprint arXiv:2302.01539, 2023.
- One-step diffusion distillation via deep equilibrium models. Advances in Neural Information Processing Systems, 36, 2024a.
- Consistency models made easy. arXiv preprint arXiv:2406.14548, 2024b.
- Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2018.
- Boot: Data-free distillation of denoising diffusion models with bootstrapping. ArXiv, abs/2306.05544, 2023a.
- Boot: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference {{\{{\\\backslash\&}}\}} Generative Modeling, 2023b.
- Multistep consistency models. arXiv preprint arXiv:2403.06807, 2024.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
- Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pp. 8867–8887. PMLR, 2022.
- Li Jing James Betker, Gabriel Goh et al. Improving image generation with better captions, 2023. URL https://cdn.openai.com/papers/dall-e-3.pdf. Available as PDF.
- Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33, 2020a.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119, 2020b.
- Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
- A multi-class hinge loss for conditional gans. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1290–1299, 2021.
- Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
- Guided-tts: A diffusion model for text-to-speech via classifier guidance. In International Conference on Machine Learning, pp. 11119–11133. PMLR, 2022.
- Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 10215–10224. 2018.
- The CIFAR-10 Dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55, 2014.
- Flow matching for generative modeling. ArXiv, abs/2210.02747, 2022a.
- Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022b.
- Scott: Accelerating diffusion models with stochastic consistency distillation. arXiv preprint arXiv:2403.01505, 2024.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
- Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023.
- Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
- Weijian Luo. A comprehensive survey on knowledge distillation of diffusion models. arXiv preprint arXiv:2304.04262, 2023.
- Weijian Luo. Diff-instruct++: Training one-step text-to-image generator model to align with human preferences. arXiv preprint arXiv:2410.18881, 2024.
- Data prediction denoising models: The pupil outdoes the master, 2024. URL https://openreview.net/forum?id=wYmcfur889.
- Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. ArXiv, abs/2305.18455, 2023a.
- Training energy-based models with diffusion contrastive divergences. arXiv preprint arXiv:2307.01668, 2023b.
- Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36, 2024a.
- One-step diffusion distillation through score implicit matching. arXiv preprint arXiv:2410.16794, 2024b.
- Entropy-based training methods for scalable neural implicit samplers. Advances in Neural Information Processing Systems, 36, 2024c.
- Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
- Action matching: Learning stochastic dynamics from samples. In International conference on machine learning, pp. 25858–25889. PMLR, 2023.
- Swiftbrush: One-step text-to-image diffusion model with variational score distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Deep equilibrium approaches to diffusion models. Advances in Neural Information Processing Systems, 35:37975–37990, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI.
- Multistep distillation of diffusion models via moment matching. arXiv preprint arXiv:2406.04103, 2024.
- Stylegan-xl: Scaling stylegan to large diverse datasets. ACM SIGGRAPH 2022 Conference Proceedings, 2022.
- Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
- Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015, 2024.
- Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- San: Inducing metrizability of gan with discriminative normalized linear layer. arXiv preprint arXiv:2301.12811, 2023.
- Integrating amortized inference with diffusion models for learning clean distribution from corrupted images. arXiv preprint arXiv:2407.11162, 2024.
- Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022.
- Smart: Improving gans with score matching regularity. arXiv preprint arXiv:2311.18208, 2023.
- Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
- Em distillation for one-step diffusion models, 2024. URL https://arxiv.org/abs/2405.16852.
- Ufogen: You forward once large scale text-to-image generation via diffusion gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8196–8206, 2024.
- SA-solver: Stochastic adams solver for fast sampling of diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=f6a9XVFYIo.
- Consistency flow matching: Defining straight flows with velocity consistency. arXiv preprint arXiv:2407.02398, 2024.
- Improved distribution matching distillation for fast image synthesis. arXiv preprint arXiv:2405.14867, 2024a.
- One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6613–6623, 2024b.
- Xuanwu Yin Yuda Song, Zehao Sun. Sdxs: Real-time one-step latent diffusion models with image conditions. arxiv, 2024.
- Mini-dalle3: Interactive text to image by prompting large language models. arXiv preprint arXiv:2310.07653, 2023.
- Purify++: Improving diffusion-purification with advanced diffusion models and control of randomness. arXiv preprint arXiv:2310.18762, 2023.
- Enhancing Adversarial Robustness via Score-Based Optimization. Advances in Neural Information Processing Systems, 36, 2024.
- Feature quantization improves gan training. arXiv preprint arXiv:2004.02088, 2020.
- Diffusion models are innate one-step generators. arXiv preprint arXiv:2405.20750, 2024.
- Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022.
- Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=HDxgaKk956l.
- Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In International Conference on Machine Learning, 2024.