Best Prompts for Text-to-Image Models and How to Find Them (2209.11711v3)

Published 23 Sep 2022 in cs.HC, cs.CL, and cs.CV

Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions.

Authors (2)

Nikita Pavlichenko (6 papers)
Dmitry Ustalov (22 papers)

Citations (52)

View on Semantic Scholar

Summary

Evaluation of Prompt Engineering for Text-to-Image Models: Methodologies and Outcomes

The paper "Best Prompts for Text-to-Image Models and How to Find Them" by Nikita Pavlichenko and Dmitry Ustalov addresses a significant challenge in the field of generative image models: the formulation of effective textual prompts for text-guided diffusion models such as Stable Diffusion. The authors introduce a human-in-the-loop methodology aided by a genetic algorithm to identify optimal combinations of keywords in textual prompts, optimizing the aesthetic quality of generated images.

Methodological Framework

The researchers tackle the problem of evaluating aesthetically pleasing images—a task not straightforward for computational methods alone—by leveraging subjective human assessments. They propose a process that combines human judgment with the optimization capabilities of a genetic algorithm. The keywords accompanying a text prompt substantially influence the quality of generated images, yet identifying the most effective keywords has been ad-hoc and intuition-driven until now.

To systematically evaluate and refine these keyword combinations, the authors employ crowd workers to perform pairwise comparisons of images generated from various keyword sets. Using the Bradley-Terry model to statistically rank these keyword sets based on human judgments, the method capitalizes on human aesthetic intuition to guide the evolution of keyword sets. This methodology is applied to Stable Diffusion, a prominent text-to-image model.

Empirical Evaluation and Results

The authors conduct a rigorous experimental evaluation using a predefined set of keywords and prompts derived from popular online sources. The dataset includes diverse categories such as portraits, landscapes, and animals. The initial keyword sets are evaluated, and through iterative optimization—guided by rankings obtained from human evaluations—the genetic algorithm refines these sets to improve image aesthetics.

The results indicate that the most popular community-derived keywords, despite being widely used, are not the most effective. Instead, a carefully optimized set of keywords identified through the proposed method significantly enhances the visual appeal of generated images. On average, the optimized set achieved higher aesthetic rankings compared to the baseline of most popular keywords.

Implications and Future Directions

This research elucidates the impact of structured prompt engineering in enhancing text-to-image generation models. It systematically demonstrates the potential of genetic algorithms combined with crowd-sourced evaluations to optimize textual inputs for diffusion models. The findings imply that automated systems backed by human feedback can discover non-intuitive, yet aesthetically superior, keyword combinations, potentially applicable to a wide range of generative models beyond Stable Diffusion.

Practically, this work provides a foundation for developing more sophisticated prompt optimization techniques. The method can be extended to other generative scenarios, including text-to-text generation, by tailoring the human evaluation framework to the specific model outputs. Future research might focus on integrating real-time feedback mechanisms, improving convergence robustness of the genetic algorithm, and evaluating scalability with larger sets of keywords and prompts.

Conclusion

In summary, the paper offers a novel approach to prompt engineering for text-to-image models, emphasizing the critical role of well-structured prompts in determining the quality of generated content. Through a combination of genetic algorithms and human evaluations, the paper not only provides a method for optimizing keyword selection but also contributes valuable insights into the nuanced interaction between text prompts and model outputs. As the field progresses, such methodologies will likely play a pivotal role in refining and enhancing the capabilities of generative AI applications.

PDF Markdown