Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation (2405.12914v2)

Published 21 May 2024 in cs.CV

Abstract: One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to LLMs, which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhiyu Tan (26 papers)
  2. Mengping Yang (11 papers)
  3. Luozheng Qin (6 papers)
  4. Hao Yang (328 papers)
  5. Ye Qian (2 papers)
  6. Qiang Zhou (123 papers)
  7. Cheng Zhang (388 papers)
  8. Hao Li (803 papers)
Citations (1)