Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences (2410.18881v2)

Published 24 Oct 2024 in cs.CV, cs.AI, and cs.LG

Abstract: One-step text-to-image generator models offer advantages such as swift inference efficiency, flexible architectures, and state-of-the-art generation performance. In this paper, we study the problem of aligning one-step generator models with human preferences for the first time. Inspired by the success of reinforcement learning using human feedback (RLHF), we formulate the alignment problem as maximizing expected human reward functions while adding an Integral Kullback-Leibler divergence term to prevent the generator from diverging. By overcoming technical challenges, we introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators. We also introduce novel theoretical insights, showing that using CFG for diffusion distillation is secretly doing RLHF with DI++. Such an interesting finding brings understanding and potential contributions to future research involving CFG. In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt-$\alpha$ as the reference diffusion processes. The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as PixelArt-$\alpha$. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models. The homepage of the paper is https://github.com/pkulwj1994/diff_instruct_pp.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  3. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  4. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. ArXiv, abs/2310.00426, 2023. URL https://api.semanticscholar.org/CorpusID:263334265.
  5. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp.  9916–9926, 2019.
  6. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  7. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023.
  8. Diffedit: Diffusion-based semantic image editing with mask guidance. ArXiv, abs/2210.11427, 2022.
  9. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  10. Variational schr\\\backslash\" odinger diffusion models. arXiv preprint arXiv:2405.04795, 2024.
  11. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  12. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  13. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  14. Stable audio open. arXiv preprint arXiv:2407.14358, 2024.
  15. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  16. A lipschitz bandits approach for continuous hyperparameter optimization. arXiv preprint arXiv:2302.01539, 2023.
  17. One-step diffusion distillation via deep equilibrium models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=b6XvK2de99.
  18. Consistency models made easy. arXiv preprint arXiv:2406.14548, 2024.
  19. Generative adversarial nets. In Advances in neural information processing systems, pp.  2672–2680, 2014.
  20. Boot: Data-free distillation of denoising diffusion models with bootstrapping. arXiv preprint arXiv:2306.05544, 2023.
  21. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10696–10706, 2022.
  22. Multistep consistency models. arXiv preprint arXiv:2403.06807, 2024.
  23. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  24. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
  25. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pp.  8867–8887. PMLR, 2022.
  26. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  27. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8110–8119, 2020.
  28. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.
  29. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  30. Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher. arXiv preprint arXiv:2405.14822, 2024.
  31. Guided-tts: A diffusion model for text-to-speech via classifier guidance. In International Conference on Machine Learning, pp.  11119–11133. PMLR, 2022.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp.  10215–10224. 2018.
  34. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4015–4026, 2023.
  35. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  36. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  37. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  38. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  39. Scott: Accelerating diffusion models with stochastic consistency distillation. arXiv preprint arXiv:2403.01505, 2024b.
  40. Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021.
  41. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023.
  42. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
  43. Weijian Luo. A comprehensive survey on knowledge distillation of diffusion models. arXiv preprint arXiv:2304.04262, 2023.
  44. Data prediction denoising models: The pupil outdoes the master, 2024. URL https://openreview.net/forum?id=wYmcfur889.
  45. Training energy-based models with diffusion contrastive divergences. arXiv preprint arXiv:2307.01668, 2023b.
  46. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36, 2024a.
  47. One-step diffusion distillation through score implicit matching. arXiv preprint arXiv:2410.16794, 2024b.
  48. Entropy-based training methods for scalable neural implicit samplers. Advances in Neural Information Processing Systems, 36, 2024c.
  49. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  50. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
  51. Swiftbrush: One-step text-to-image diffusion model with variational score distillation. arXiv preprint arXiv:2312.05239, 2023.
  52. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
  53. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  54. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  55. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  56. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024.
  57. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  58. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  59. Deep equilibrium approaches to diffusion models. Advances in Neural Information Processing Systems, 35:37975–37990, 2022.
  60. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  61. Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
  62. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  63. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  64. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  65. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  66. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686, 2024.
  67. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  68. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  69. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI.
  70. Multistep distillation of diffusion models via moment matching. arXiv preprint arXiv:2406.04103, 2024.
  71. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International conference on machine learning, pp.  30105–30118. PMLR, 2023a.
  72. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023b.
  73. Christoph Schuhmann. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022. Accessed: 2023 - 11- 10.
  74. Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations.
  75. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  76. Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems, pp.  11918–11930, 2019.
  77. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
  78. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  79. Sdxs: Real-time one-step latent diffusion models with image conditions. arXiv preprint arXiv:2403.16627, 2024.
  80. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  81. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023.
  82. Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  83. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8228–8238, 2024.
  84. Integrating amortized inference with diffusion models for learning clean distribution from corrupted images. arXiv preprint arXiv:2407.11162, 2024.
  85. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
  86. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
  87. Em distillation for one-step diffusion models. arXiv preprint arXiv:2405.16852, 2024.
  88. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023a.
  89. Versatile diffusion: Text, images and variations all in one diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7754–7765, 2023b.
  90. Ufogen: You forward once large scale text-to-image generation via diffusion gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8196–8206, 2024.
  91. SA-solver: Stochastic adams solver for fast sampling of diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=f6a9XVFYIo.
  92. Perflow: Piecewise rectified flow as universal plug-and-play accelerator. arXiv preprint arXiv:2405.07510, 2024.
  93. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8941–8951, 2024.
  94. One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023.
  95. Improved distribution matching distillation for fast image synthesis. arXiv preprint arXiv:2405.14867, 2024.
  96. Enhancing adversarial robustness via score-based optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=MOAHXRzHhm.
  97. Purify++: Improving diffusion-purification with advanced diffusion models and control of randomness. arXiv preprint arXiv:2310.18762, 2023b.
  98. Diffusion models are innate one-step generators. arXiv preprint arXiv:2405.20750, 2024.
  99. Trajectory consistency distillation. arXiv preprint arXiv:2402.19159, 2024.
  100. Long and short guidance in score identity distillation for one-step text-to-image generation. arXiv preprint arXiv:2406.01561, 2024a.
  101. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. arXiv preprint arXiv:2404.04057, 2024b.
  102. Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  17907–17917, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 3 likes.

Upgrade to Pro to view all of the tweets about this paper: