Improving Text-to-Image Consistency via Automatic Prompt Optimization
The paper "Improving Text-to-Image Consistency via Automatic Prompt Optimization" presents a novel framework, OPT2I, aimed at enhancing the consistency between textual prompts and their corresponding generated images in text-to-image (T2I) models. Despite significant advancements in T2I technologies that have led to the creation of high-quality and aesthetically pleasing images, the challenge of achieving accurate prompt-image consistency persists. This paper addresses key issues in current methodologies, such as the need for model fine-tuning, limited focus on nearby prompt samples, and trade-offs across image quality, representation diversity, and prompt consistency.
Motivation and Challenges
Current T2I models often fail to precisely reflect the nuances and specifics of input prompts. Common consistency issues include incorrect object cardinality, misattributed features, or neglected spatial orientations. Existing solutions, while somewhat effective, often require direct access to model weights for tasks like fine-tuning, which limits their applicability in API-only accessible models. Moreover, methods like guidance scale adjustment or post-hoc image selection confront issues of reduced image quality and diversity.
Proposed Method: OPT2I
The paper introduces OPT2I, a framework leveraging LLMs for automatic prompt optimization. OPT2I iteratively refines user prompts to optimize a consistency score without necessitating modifications to the T2I models themselves. This approach is inherently versatile, applicable across various T2I models, LLMs, and consistency metrics.
The process begins with generating images from a user prompt using a T2I model and evaluating their consistency using predefined metrics, such as decomposed CLIPScore (dCS) and Davidsonian Scene Graph (DSG) score. These metrics assess how well the images represent each component of the prompt. The LLM then suggests alternative prompts based on past performance, continually refining these suggestions to maximize the consistency score. Thus, OPT2I serves as a plug-and-play solution, adaptable to different models and scoring methodologies, while ushering in a training-free paradigm for prompt optimization.
Analysis and Results
The research demonstrates that OPT2I significantly enhances prompt-image consistency, achieving improvements up to 24.9% on datasets such as MSCOCO and PartiPrompts, while preserving the Fréchet Inception Distance (FID) and increasing the recall of generated versus real data. The framework consistently outperforms paraphrasing baselines and is robust across various LLMs and T2I models. Qualitatively, optimized prompts often emphasize elements previously overlooked, refining the generated images towards better alignment with the original prompt.
Moreover, the paper highlights the utility of detailed consistency metrics like the DSG score, particularly for complex prompts requiring nuanced understanding of relationships and attributes within the scene. The optimization process demonstrates improvements through exploitation of prompt modifications, either by providing enhanced detail or by reordering prompt elements strategically.
Implications and Future Directions
The implications of this research are significant, offering a method to refine T2I processes efficiently without altering model architectures. By improving prompt consistency, OPT2I enhances the utility and reliability of T2I models in practical applications, potentially influencing areas such as digital content creation, virtual reality, and human-computer interaction.
Future research could further explore robustness to diverse prompt complexities and enhance metrics for prompt-image consistency. Additionally, automation in the domain of prompt optimization opens pathways for even more sophisticated models and applications, potentially integrating feedback loops from user interactions or real-world applications. As T2I models continue to evolve, frameworks like OPT2I offer pivotal foundations for achieving greater synthesis between text and image modalities.