Expressive Text-to-Image Generation with Rich Text (2304.06720v3)

Published 13 Apr 2023 in cs.CV, cs.GR, and cs.LG

Abstract: Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.

Authors (4)

Songwei Ge (24 papers)
Taesung Park (24 papers)
Jun-Yan Zhu (80 papers)
Jia-Bin Huang (106 papers)

Citations (64)

View on Semantic Scholar

Summary

Expressive Text-to-Image Generation with Rich Text

The paper "Expressive Text-to-Image Generation with Rich Text" introduces a novel approach to text-to-image synthesis by leveraging rich text attributes to refine and control the generation process beyond the capabilities of plain text interfaces. The fundamental limitation of existing text-to-image models is their reliance solely on plain text prompts, which restricts user control over specific stylistic and qualitative aspects of the generated images. Addressing these limitations, this paper proposes utilizing rich-text editors, which provide formatting options that can encode more detailed information, such as font styles, colors, sizes, images, and footnotes, for precise image generation.

Methodology

The research introduces a two-step methodological framework to harness rich text attributes for enhanced image synthesis. The first step involves associating text with spatial image regions by extracting attention maps from pretrained diffusion models. These cross- and self-attention maps act as guides to segment the generated image into distinct regions, each corresponding to specific token spans in the text. The second step employs a region-based diffusion process to incorporate rich-text attributes, such as local style application, precise color rendering, detailed region descriptions, and token reweighting based on font sizes, to generate images faithfully representing the user's requirements.

Key Applications

Local Style Control: This involves applying specific artistic styles to distinct image regions through font style attributes, enabling the generation of a heterogeneous stylized image, such as different painting styles for different objects in a scene.
Precise Color Control: Using font colors, users can specify exact RGB values or descriptive color names to render accurate hues, addressing the shortcomings of text encoders in interpreting complex or non-standard color descriptions.
Region-based Detailed Description: Footnotes allow users to append detailed descriptions to specific image regions without length constraints that typically hinder lengthy plain-text prompts. This facilitates more effective synthesis of complex scenes with multiple objects and attributes.
Explicit Token Reweighting: Font sizes are used to adjust the prominence or scale of specific objects within the image. This method modulates cross-attention maps, ensuring that reweighting aligns with desired aesthetic or thematic emphasis.
Embedded Image Guidance: This enables the integration of visual references into text prompts, allowing users to bespoke image generation with high fidelity to reference concepts, particularly useful for personalized or custom image synthesis applications.

Experimental Evaluation

The paper rigorously evaluates the proposed method against established baselines such as Prompt-to-Prompt and InstructPix2Pix in various applications, revealing its advantages in generating images with precise color matching, distinct local styles, and correct object representation in complex scenes. Quantitative metrics, including color distance and local CLIP scores, were utilized to assess the improvements in fidelity and alignment between text and generated image attributes, demonstrating the method's efficacy in adhering to user specifications better than baseline approaches.

Implications and Future Work

The implications of this research are profound in advancing text-to-image synthesis technologies, offering increased accessibility and control to users across creative and practical domains. The integration of rich-text attributes delineates a path for further exploration into formatting options such as hyperlinks, bullet points, and other text features that could provide even more granular control over image generation. Future work may investigate expanding these methodologies to other types of content creation and synthesizing models beyond images, potentially influencing interactive design, virtual environment creation, and personalized media generation.

In conclusion, this paper establishes a transformative approach to text-to-image synthesis by embedding intuitive, rich-text controls into generative models, significantly enhancing user agency in digital content creation processes.

Expressive Text-to-Image Generation with Rich Text (2304.06720v3)

Summary

Expressive Text-to-Image Generation with Rich Text

Methodology

Key Applications

Experimental Evaluation

Implications and Future Work

Tweets

YouTube

Expressive Text-to-Image Generation with Rich Text (2304.06720v3)

Summary

Expressive Text-to-Image Generation with Rich Text

Methodology

Key Applications

Experimental Evaluation

Implications and Future Work

Related Papers

Tweets

YouTube