Dynamic Prompt Optimizing for Text-to-Image Generation (2404.04095v1)
Abstract: Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment. Code is available at https://github.com/Mowenyii/PAE.
- All are worth words: A vit backbone for diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2023.
- Language models are few-shot learners. arXiv, abs/2005.14165, 2020.
- Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023.
- VQGAN-CLIP: open domain image generation and editing with natural language guidance. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVII, pages 88–105. Springer, 2022.
- What is in a text-to-image prompt: The potential of stable diffusion in visual arts education. CoRR, abs/2301.01902, 2023.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- Taming transformers for high-resolution image synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12868–12878, 2020.
- David E. Goldberg. Genetic Algorithms in Search Optimization and Machine Learning. Addison-Wesley, 1989.
- Optimizing prompts for text-to-image generation. CoRR, abs/2212.09611, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1–47:33, 2022.
- Rethinking FID: towards a better evaluation metric for image generation. CoRR, abs/2401.09603, 2024.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
- On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
- Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
- Design guidelines for prompt engineering text-to-image generative models. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- OpenAI. Gpt-4 technical report. arXiv, abs/2303.08774, 2023.
- Jonas Oppenlaender. The creativity of text-to-image generation. In 25th International Academic Mindtrek conference, Academic Mindtrek 2022, Tampere, Finland, November 16-18, 2022, pages 192–202, 2022a.
- Jonas Oppenlaender. A taxonomy of prompt modifiers for text-to-image generation. arXiv, abs/2204.13988, 2022b.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Best prompts for text-to-image models and how to find them. arXiv, abs/2209.11711, 2022.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Improving language understanding by generative pre-training. OpenAI, 2018.
- Language models are unsupervised multitask learners. 2019.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022a.
- Hierarchical text-conditional image generation with clip latents. arXiv, abs/2204.06125, 2022b.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Deep unsupervised learning using nonequilibrium thermodynamics. arXiv, abs/1503.03585, 2015.
- Intriguing property and counterfactual explanation of gan for remote sensing image generation, 2023a.
- A unified gan framework regarding manifold alignment for remote sensing images generation. arXiv preprint arXiv:2305.19507, 2023b.
- Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2015.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Artwhisperer: A dataset for characterizing human-ai interactions in artistic creations. CoRR, abs/2306.08141, 2023.
- DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 893–911, Toronto, Canada, 2023. Association for Computational Linguistics.
- A learning algorithm for continually running fully recurrent neural networks. Neural Comput., 1(2):270–280, 1989.
- A prompt log analysis of text-to-image generation systems. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 3892–3902. ACM, 2023.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv, abs/2304.05977, 2023.
- Versatile diffusion: Text, images and variations all in one diffusion model. CoRR, abs/2211.08332, 2022.
- Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv, abs/2302.04867, 2023.
- Collaborative generative AI: integrating gpt-k for efficient editing in text-to-image generation. CoRR, abs/2305.11317, 2023.
- Wenyi Mo (4 papers)
- Tianyu Zhang (111 papers)
- Yalong Bai (23 papers)
- Bing Su (46 papers)
- Ji-Rong Wen (299 papers)
- Qing Yang (138 papers)