TextCraftor: Your Text Encoder Can be Image Quality Controller (2403.18978v1)

Published 27 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other LLMs, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.

References (62)

Authors (9)

Yanyu Li (31 papers)
Xian Liu (37 papers)
Anil Kag (16 papers)
Ju Hu (9 papers)
Yerlan Idelbayev (9 papers)
Dhritiman Sagar (2 papers)
Yanzhi Wang (197 papers)
Sergey Tulyakov (108 papers)
Jian Ren (97 papers)

Citations (9)

View on Semantic Scholar

Summary

TextCraftor: Enhancing Text-to-Image Diffusion Models via Text Encoder Fine-Tuning

The paper "TextCraftor: Your Text Encoder Can be Image Quality Controller" presents an innovative approach to enhancing the performance of text-to-image diffusion models by fine-tuning the text encoder. The authors explore the possibility of fine-tuning the pre-trained text encoder in diffusion models, instead of substituting it with other LLMs. This methodology is encapsulated in their proposed framework, TextCraftor, which aims to improve image quality and text-image alignment.

Framework and Techniques

The core contribution of this paper lies in the introduction of TextCraftor, a fine-tuning framework designed to optimize the text encoder without replacing it. The authors demonstrate that fine-tuning the existing CLIP text encoder, rather than substituting it with models like T5, can yield significant improvements in image generation quality. This process is supported by using various reward functions, such as aesthetic predictors and text-image alignment models, to guide the fine-tuning of the encoder. The reward functions are integrated in a differentiable manner, enabling efficient fine-tuning using only text prompts, thus avoiding the need for extensive text-image paired datasets.

The TextCraftor framework leverages an end-to-end training pipeline that combines these reward functions with a novel alignment constraint. This constraint helps maintain the capabilities of the pre-trained CLIP text encoder, ensuring the model remains generic and capable of handling a broad range of inputs. Notably, the proposed approach does not introduce additional computational or storage overhead, a critical consideration given the significant size of modern deep learning models.

Experimental Results

The experimental evaluation of TextCraftor demonstrates compelling enhancements over traditional models. On the Parti-Prompt and HPSv2 benchmark datasets, TextCraftor outperforms not only the baseline Stable Diffusion models (SDv1.5 and SDv2.0) but also larger models such as SDXL Base 0.9 and DeepFloyd-XL. Moreover, TextCraftor achieves better textual alignment and image quality scores, compared with methods utilizing automatic prompt engineering and reinforcement learning approaches like DDPO.

These improvements are quantified through automated metrics across various benchmarks as well as human assessments, highlighting the broader applicability and robustness of the technique. Furthermore, the paper illustrates that TextCraftor can complement existing UNet finetuning methodologies, suggesting potential avenues for combining enhancements to achieve even greater model performance.

Implications and Future Directions

TextCraftor's success in enhancing text-to-image diffusion models has both theoretical and practical implications. Theoretically, it shifts the focus from solely improving the UNet component to considering improvements at the text encoding stage, broadening the horizons for future research into model optimization and architecture design. Practically, the ability to fine-tune pre-existing models to achieve significant improvements suggests that similar techniques can be applied to other machine learning domains, potentially leading to more efficient deployment of high-performing models without additional computational burdens.

In future developments, as reward models continue to improve, there exists a potential to further enhance the quality of diffusion models by integrating even more sophisticated and nuanced reward functions. Additionally, the idea of interpolating between different fine-tuned versions of text encoders opens new prospects for controllable and diverse image generation, providing users with greater creative flexibility. These developments underscore the transformative potential of the TextCraftor framework in the evolving landscape of AI and machine learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1773558300342133141

https://twitter.com/SoroushMhrbn/status/1774595615390835129

https://twitter.com/javaeeeee1/status/1773703434832687486

YouTube

Show All Videos