Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding (2404.11589v1)
Abstract: The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt LLM (PLM), which is initialized from a pre-trained LLM, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Knowledge Graph Completion Models are Few-shot Learners: An Empirical Study of Relation Labeling in E-commerce with LLMs. arXiv preprint arXiv:2305.09858 (2023).
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780–8794.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023).
- LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction. arXiv preprint arXiv:2403.00863 (2024).
- Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision. Springer, 89–106.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. ICLR (2023).
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611 (2022).
- Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research 23, 47 (2022), 1–33.
- Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22691–22702.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023).
- Optimizing Prompts using In-Context Few-Shot Learning for Text-to-Image Generative Models. IEEE Access (2024).
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
- Text-to-Image Generation for Abstract Concepts. arXiv preprint arXiv:2309.14623 (2023).
- CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness. arXiv preprint arXiv:2402.14833 (2024).
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410 (2023).
- LLM-TAKE: Theme-Aware Keyword Extraction Using Large Language Models. In 2023 IEEE International Conference on Big Data (BigData). IEEE, 4318–4324.
- Is the elephant flying? resolving ambiguities in text-to-image generative models. arXiv preprint arXiv:2211.12503 (2022).
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
- Jonas Oppenlaender. 2022a. Prompt engineering for text-based generative art. arXiv preprint arXiv:2204.13988 (2022).
- Jonas Oppenlaender. 2022b. A taxonomy of prompt modifiers for text-to-image generation. arXiv preprint arXiv:2204.13988 (2022).
- Nikita Pavlichenko and Dmitry Ustalov. 2023. Best prompts for text-to-image models and how to find them. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2067–2071.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. ICCV (2023).
- Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420 (2023).
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023).
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2, 3 (2022), 5.
- Zezhong Fan (3 papers)
- Xiaohan Li (33 papers)
- Chenhao Fang (4 papers)
- Topojoy Biswas (4 papers)
- Kaushiki Nag (11 papers)
- Jianpeng Xu (12 papers)
- Kannan Achan (45 papers)