Capability-aware Prompt Reformulation Learning for Text-to-Image Generation (2403.19716v1)
Abstract: Text-to-image generation systems have emerged as revolutionary tools in the realm of artistic creation, offering unprecedented ease in transforming textual prompts into visual art. However, the efficacy of these systems is intricately linked to the quality of user-provided prompts, which often poses a challenge to users unfamiliar with prompt crafting. This paper addresses this challenge by leveraging user reformulation data from interaction logs to develop an automatic prompt reformulation model. Our in-depth analysis of these logs reveals that user prompt reformulation is heavily dependent on the individual user's capability, resulting in significant variance in the quality of reformulation pairs. To effectively use this data for training, we introduce the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively integrates user capability into the reformulation process through two key components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). CRM reformulates prompts according to a specified user capability, as represented by CCF. The CCF, in turn, offers the flexibility to tune and guide the CRM's behavior. This enables CAPR to effectively learn diverse reformulation strategies across various user capacities and to simulate high-capability user reformulation during inference. Extensive experiments on standard text-to-image generation benchmarks showcase CAPR's superior performance over existing baselines and its remarkable robustness on unseen systems. Furthermore, comprehensive analyses validate the effectiveness of different components. CAPR can facilitate user-friendly interaction with text-to-image systems and make advanced artistic creation more achievable for a broader range of users.
- [n.d.]. PromptHero - Search prompts for Stable Diffusion, ChatGPT & Midjourney. https://prompthero.com/ Accessed: 2024-01-20.
- [n.d.]. Stable Diffusion - Prompts examples. https://stablediffusion-fr.webpkgcache.com/doc/-/s/stablediffusion.fr/prompts Accessed: 2024-01-20.
- Query recommendation using query logs in search engines. In International conference on extending database technology. Springer, 588–596.
- bluelovers. 2023. ChatGPT Stable Diffusion Prompts Generator. https://gist.github.com/bluelovers/92dac6fe7dcbafd7b5ae0557e638e6ef#file-chatgpt-stable-diffusion-prompts-generator-txt. Accessed: 2023-7-20.
- Query reformulation mining: models, patterns, and applications. Information retrieval 14 (2011), 257–289.
- Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–14.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval 10, 4 (2016), 273–363.
- Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR) 44, 1 (2012), 1–50.
- Analysis of the query logs of a web site search engine. Journal of the American Society for Information Science and Technology 56, 13 (2005), 1363–1376.
- A Hybrid Framework for Session Context Modeling. ACM Transactions on Information Systems (TOIS) 39, 3 (2021), 1–35.
- Towards a better understanding of query reformulation behavior in web search. In Proceedings of the web conference 2021. 743–755.
- Image annotation tactics: transitions, strategies and efficiency. Information Processing & Management 54, 6 (2018), 985–1001.
- Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning. arXiv preprint arXiv:2012.10033 (2020).
- Manipulating Embeddings of Stable Diffusion Prompts. arXiv preprint arXiv:2308.12059 (2023).
- Learning to attend, copy, and generate for session-based query suggestion. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1747–1756.
- Efthimis N Efthimiadis. 2000. Interactive query expansion: A user-based evaluation in a relevance feedback environment. Journal of the American Society for Information Science 51, 11 (2000), 989–1003.
- MTTN: Multi-Pair Text to Text Narratives for Prompt Generation. arXiv preprint arXiv:2301.10172 (2023).
- Utilizing query change for session search. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 453–462.
- Modeling Information Need of Users in Search Sessions. arXiv preprint arXiv:2001.00861 (2020).
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611 (2022).
- scikit-optimize/scikit-optimize.
- Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
- Learning user reformulation behavior for query auto-completion. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 445–454.
- Variational Distribution Learning for Unsupervised Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23380–23389.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10124–10134.
- Variational diffusion models. Advances in neural information processing systems 34 (2021), 21696–21707.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Text to image generation with semantic-spatial aware gan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18187–18196.
- Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–23.
- Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14267–14276.
- Midjourney. 2023. Midjourney: An Independent Research Lab Exploring New Mediums of Thought. https://www.midjourney.com/. [Online; accessed 21-January-2024].
- OpenAI. 2023a. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf. Accessed: 2023-11-13.
- OpenAI. 2023b. Improving Image Generation with Better Captions. https://cdn.openai.com/papers/dall-e-3.pdf. Accessed: 2023-11-13.
- Jonas Oppenlaender. 2022. A taxonomy of prompt modifiers for text-to-image generation. arXiv preprint arXiv:2204.13988 2 (2022).
- Guy Parsons. 2022. The DALL·E 2 Prompt Book. https://dallery.gallery/the-dalle-2-prompt-book.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV]
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Gerard Salton and Chris Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American society for information science 41, 4 (1990), 288–297.
- Christoph Schuhmann. 2022. Improved Aesthetic Predictor. https://github.com/christophschuhmann/improved-aesthetic-predictor.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
- Prompt Stealing Attacks Against Text-to-Image Generation Models. arXiv preprint arXiv:2302.09923 (2023).
- Analysis of a very large web search engine query log. In Acm sigir forum, Vol. 33. ACM New York, NY, USA, 6–12.
- A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In proceedings of the 24th ACM international on conference on information and knowledge management. 553–562.
- Use of query reformulation and relevance feedback by Excite users. Internet research 10, 4 (2000), 317–328.
- Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 893–911.
- Sam Witteveen and Martin Andrews. 2022. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462 (2022).
- Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv preprint arXiv:2306.09341 (2023).
- A prompt log analysis of text-to-image generation systems. In Proceedings of the ACM Web Conference 2023. 3892–3902.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023).
- How neural networks extrapolate: From feedforward to graph neural networks. arXiv preprint arXiv:2009.11848 (2020).
- TinyLlama: An Open-Source Small Language Model. arXiv:2401.02385 [cs.CL]
- Shifted diffusion for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10157–10166.
- Jingtao Zhan (17 papers)
- Qingyao Ai (113 papers)
- Yiqun Liu (131 papers)
- Jia Chen (85 papers)
- Shaoping Ma (39 papers)