Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis (2402.12760v1)

Published 20 Feb 2024 in cs.MM, cs.AI, and cs.CV

Abstract: Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. I spy a metaphor: Large language models and diffusion models co-create visual metaphors.
  2. IQA-PyTorch: PyTorch Toolbox for Image Quality Assessment. [Online]. Available: https://github.com/chaofengc/IQA-PyTorch.
  3. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2818–2829.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation. In Proceedings of the European Conference on Computer Vision (ECCV).
  6. No-reference image quality assessment via transformers, relative ranking, and self-consistency. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1220–1230.
  7. Optimizing Prompts for Text-to-Image Generation. In Thirty-seventh Conference on Neural Information Processing Systems.
  8. BERTese: Learning to Speak to BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 3618–3623.
  9. Holub, O. 2022. DiscordChatExporter. https://github.com/Tyrrrz/DiscordChatExporter.
  10. Holz, D. 2023. Midjourney alpha-release announcement on Discord. https://www.midjourney.com/.
  11. KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Transactions on Image Processing, 29: 4041–4056.
  12. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 423–438.
  13. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 26565–26577.
  14. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5148–5157.
  15. Large-scale text-to-image generation models for visual artists’ creative works. In Proceedings of the 28th International Conference on Intelligent User Interfaces, 919–933.
  16. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  17. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment. arXiv preprint arXiv:2306.04717.
  18. Iterative Prompt Learning for Unsupervised Backlit Image Enhancement. In ICCV.
  19. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1–23.
  20. Opal: Multimodal image generation for news illustration. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, 1–17.
  21. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  22. michellejieli. 2022. NSFW Text Classifier. https://huggingface.co/michellejieli/NSFW˙text˙classifier.
  23. AVA: A large-scale database for aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition, 2408–2415. IEEE.
  24. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning, 16784–16804. PMLR.
  25. OpenAI, O. 2023. GPT-4 Technical Report.
  26. Oppenlaender, J. 2023. A taxonomy of prompt modifiers for text-to-image generation. Behaviour & Information Technology, 1–14.
  27. Best prompts for text-to-image models and how to find them. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2067–2071.
  28. Color image database TID2013: Peculiarities and preliminary results. In European workshop on visual information processing (EUVIP), 106–111. IEEE.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  30. Language Models are Unsupervised Multitask Learners.
  31. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1–67.
  32. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  33. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
  34. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695.
  35. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
  36. Santana, G. 2022. Gustavosta/Stable-Diffusion-Prompts ⋅⋅\cdot⋅ Datasets at Hugging Face. https://huggingface.co/datasets/Gustavosta/Stable-Diffusion-Prompts. Accessed:2023-01-26.
  37. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. volume abs/2301.09515.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
  39. sshleifer. 2021. distilbart-cnn-12-6. https://huggingface.co/sshleifer/distilbart-cnn-12-6.
  40. NIMA: Neural image assessment. IEEE transactions on image processing, 27(8): 3998–4011.
  41. von Platen, P. 2020. How to generate text: using different decoding methods for language generation with Transformers. https://huggingface.co/blog/how-to-generate/.
  42. RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 1–29.
  43. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896.
  44. Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery. In Thirty-seventh Conference on Neural Information Processing Systems.
  45. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2): 270–280.
  46. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420.
  47. Blind Image Quality Assessment Using A Deep Bilinear Convolutional Neural Network. IEEE Transactions on Circuits and Systems for Video Technology, 30(1): 36–47.
  48. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  49. Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nailei Hei (1 paper)
  2. Qianyu Guo (16 papers)
  3. Zihao Wang (216 papers)
  4. Yan Wang (733 papers)
  5. Haofen Wang (32 papers)
  6. Wenqiang Zhang (87 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com