Boosting Latent Diffusion with Flow Matching (2312.07360v3)
Abstract: Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and a convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the latent diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at $10242$ pixels with minimal computational cost. Further scaling up our method we can reach resolutions up to $20482$ pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.
- LAION-Aesthetics | https://laion.ai/blog/laion-aesthetics.
- OpenAI | https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/unet.py.
- Unsplash | https://unsplash.com/data.
- Building normalizing flows with stochastic interpolants. In ICLR, 2023.
- Stochastic interpolants with data-dependent couplings. arXiv preprint arXiv:2310.03725, 2023.
- Efficient video prediction via sparsely conditioned flow matching. In ICCV, 2023.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
- Ricky T. Q. Chen. torchdiffeq, 2018.
- Flow matching in latent space. arXiv preprint arXiv:2307.08698, 2023.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a.
- Video diffusion models. In arXiv, 2022b.
- Latent space editing in transformer-based flow matching. In AAAI, 2024.
- Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis, 2023. arXiv:2306.08645 [cs, eess].
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
- Variational diffusion models. In NeurIPS, 2021.
- Voicebox: Text-guided multilingual universal speech generation at scale. In arXiv, 2023.
- Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Flow matching for generative modeling. In ICLR, 2023.
- Audioldm: Text-to-audio generation with latent diffusion models. In ICML, 2023a.
- Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023b.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 2022a.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
- Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023b.
- On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
- Action matching: Learning stochastic dynamics from samples. In ICML, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, 2022.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
- Image super-resolution via iterative refinement. TPAMI, 45(4):4713–4726, 2022b.
- Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Aligning latent and image spaces to connect the unconnectable. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14144–14153, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2021a.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
- Consistency models. In ICML, 2023.
- Improving and generalizing flow-based generative models with minibatch optimal transport. In ICML, 2023.
- Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018.
- Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
- Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
- Resshift: Efficient diffusion model for image super-resolution by residual shifting. arXiv preprint arXiv:2307.12348, 2023.
- Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Ming Gui (7 papers)
- Pingchuan Ma (90 papers)
- Nick Stracke (4 papers)
- Stefan A. Baumann (1 paper)
- Björn Ommer (72 papers)
- Johannes Schusterbauer (5 papers)
- Vincent Tao Hu (22 papers)