Improving Text-to-Image Consistency via Automatic Prompt Optimization (2403.17804v1)

Published 26 Mar 2024 in cs.CV and cs.CL

Abstract: Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a LLM to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

PDF Abstract

Improving Text-to-Image Consistency via Automatic Prompt Optimization

The paper "Improving Text-to-Image Consistency via Automatic Prompt Optimization" presents a novel framework, OPT2I, aimed at enhancing the consistency between textual prompts and their corresponding generated images in text-to-image (T2I) models. Despite significant advancements in T2I technologies that have led to the creation of high-quality and aesthetically pleasing images, the challenge of achieving accurate prompt-image consistency persists. This paper addresses key issues in current methodologies, such as the need for model fine-tuning, limited focus on nearby prompt samples, and trade-offs across image quality, representation diversity, and prompt consistency.

Motivation and Challenges

Current T2I models often fail to precisely reflect the nuances and specifics of input prompts. Common consistency issues include incorrect object cardinality, misattributed features, or neglected spatial orientations. Existing solutions, while somewhat effective, often require direct access to model weights for tasks like fine-tuning, which limits their applicability in API-only accessible models. Moreover, methods like guidance scale adjustment or post-hoc image selection confront issues of reduced image quality and diversity.

Proposed Method: OPT2I

The paper introduces OPT2I, a framework leveraging LLMs for automatic prompt optimization. OPT2I iteratively refines user prompts to optimize a consistency score without necessitating modifications to the T2I models themselves. This approach is inherently versatile, applicable across various T2I models, LLMs, and consistency metrics.

The process begins with generating images from a user prompt using a T2I model and evaluating their consistency using predefined metrics, such as decomposed CLIPScore (dCS) and Davidsonian Scene Graph (DSG) score. These metrics assess how well the images represent each component of the prompt. The LLM then suggests alternative prompts based on past performance, continually refining these suggestions to maximize the consistency score. Thus, OPT2I serves as a plug-and-play solution, adaptable to different models and scoring methodologies, while ushering in a training-free paradigm for prompt optimization.

Analysis and Results

The research demonstrates that OPT2I significantly enhances prompt-image consistency, achieving improvements up to 24.9% on datasets such as MSCOCO and PartiPrompts, while preserving the Fréchet Inception Distance (FID) and increasing the recall of generated versus real data. The framework consistently outperforms paraphrasing baselines and is robust across various LLMs and T2I models. Qualitatively, optimized prompts often emphasize elements previously overlooked, refining the generated images towards better alignment with the original prompt.

Moreover, the paper highlights the utility of detailed consistency metrics like the DSG score, particularly for complex prompts requiring nuanced understanding of relationships and attributes within the scene. The optimization process demonstrates improvements through exploitation of prompt modifications, either by providing enhanced detail or by reordering prompt elements strategically.

Implications and Future Directions

The implications of this research are significant, offering a method to refine T2I processes efficiently without altering model architectures. By improving prompt consistency, OPT2I enhances the utility and reliability of T2I models in practical applications, potentially influencing areas such as digital content creation, virtual reality, and human-computer interaction.

Future research could further explore robustness to diverse prompt complexities and enhance metrics for prompt-image consistency. Additionally, automation in the domain of prompt optimization opens pathways for even more sophisticated models and applications, potentially integrating feedback loops from user interactions or real-world applications. As T2I models continue to evolve, frameworks like OPT2I offer pivotal foundations for achieving greater synthesis between text and image modalities.