Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (2309.15807v1)

Published 27 Sep 2023 in cs.CV

Abstract: Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

PDF Abstract

Enhancing Image Generation Models with Quality-Tuning: A Study of Emu

The paper "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack" introduces a novel approach to improve the aesthetics of generated images from text-to-image models. This work addresses a notable challenge in the field, where models trained on large-scale image-text datasets often do not inherently generate aesthetically appealing images. The authors propose a two-stage process: pre-training coupled with a unique fine-tuning technique called "quality-tuning," wherein a pre-trained model is further refined with a carefully curated small dataset of high-quality images. This strategy successfully enhances the aesthetic quality while preserving the model's ability to generate a diverse range of visual concepts.

Key Contributions and Methodology

Quality-Tuning Process: The core innovation of this paper lies in using a minimal dataset of manually selected high-quality images to fine-tune pre-trained models. By leveraging principles from photography such as composition, lighting, and storytelling, the authors curated a dataset that guides the model toward generating images with superior aesthetic quality.
Pre-Training Foundation: The models were initially trained using a latent diffusion approach on a substantial dataset of 1.1 billion image-text pairs. This large-scale pre-training phase ensures the model has a comprehensive understanding of diverse visual concepts.
Evaluation and Performance: The resulting model, Emu, demonstrates a significant improvement in generating aesthetically pleasing images compared to its pre-trained counterpart. Emu outperformed state-of-the-art models such as SDXLv1.0, achieving a win rate of over 68.4% and 71.3% on visual appeal across standardized benchmarks and open user prompts, respectively.
Broad Applicability: The authors illustrate that quality-tuning is a flexible technique applicable across various architectures, including pixel diffusion and masked generative transformer models, not just latent diffusion models. This flexibility underscores quality-tuning as a generic strategy for enhancing image generative models.

Implications and Future Work

This paper holds several implications for the development and enhancement of generative models. Practically, the quality-tuning method provides a cost-effective way of enhancing generation quality without the need for excessively large datasets. It aligns with approaches in NLP models where a small but high-quality dataset can significantly steer the model's output quality.

Theoretically, the results challenge the emphasis on large-scale data by demonstrating the potential of small, high-quality datasets to redefine output aesthetics. This insight may prompt further research into data curation techniques and the role of human feedback in model refinement.

Future developments could explore how this methodology might integrate with other domains of generative AI, such as text-to-video or 3D content creation. Additionally, research could be directed at automating parts of the curation process to further scale the quality-tuning paradigm.

Conclusion

The introduction of quality-tuning as detailed in this paper represents a significant advancement in the pursuit of enhancing the aesthetic quality of text-to-image generative models. By focusing on data quality rather than quantity, the authors have opened a new pathway for improving the visual appeal of generated content, emphasizing the critical role of curation in AI development. As AI continues to advance, techniques like quality-tuning will likely be pivotal in bridging the gap between technical capacity and artistic creativity.