Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training (2410.04439v1)

Published 6 Oct 2024 in cs.CV and cs.AI

Abstract: Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.

PDF HTML Abstract

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

The research paper, "Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training," introduces a set of methodologies that aim to enhance the capabilities of diffusion-based text-to-image models in generating legible visual texts. These models traditionally excel in producing diverse and aesthetically pleasing images but struggle with precise text generation, often resulting in illegible or incorrectly spelled visual text outputs.

Key Contributions

The paper identifies two primary challenges constraining the performance of existing backbone models:

Byte Pair Encoding (BPE) Tokenization: The BPE tokenization approach requires the model to combine subwords into complete visual words, which increases the complexity and difficulty of generating accurate visual texts.
Learning of Cross-Attention Modules: Insufficient learning within the cross-attention modules hinders the model's ability to effectively bind visual texts to corresponding text tokens.

To address these challenges, the authors propose several enhancements:

Mixed Granularity Input Strategy: This approach improves text representations by considering entire glyph words as units, thus avoiding the complications introduced by subword tokenization. This is achieved by extracting intermediate features from an OCR model to serve as new text embeddings, incorporating both the BPE and glyph-level information.
Glyph-Aware Training Losses: The paper introduces three novel losses:
- Attention Alignment Loss: It refines cross-attention maps to better associate visual texts with their corresponding text tokens.
- Local MSE Loss: Highlights the significance of visual text areas, enhancing focus during training.
- OCR Recognition Loss: Encourages accuracy in visual text generation by incorporating OCR recognition feedback into the training process.

Experimental Evaluation

The methods proposed are evaluated using the ChineseDrawText benchmark, showcasing measurable improvements over existing backbone models such as SD-XL and SDXL-Turbo. The models trained with the authors' methodologies demonstrated superior performance in metrics like OCR accuracy and CLIP scores, affirming the effectiveness of their approach.

Implications and Future Directions

Practically, this research holds promise for improving applications that rely heavily on visual text generation within images, such as advertising graphics and digital content creation tools. Theoretically, it highlights the potential paths for refining text-image correlation and rendering in diffusion models.

Future explorations could consider the scalability of this approach across different languages and even more complex styles of text presentation. Additionally, integrating these methodologies into real-time image-to-image translation tasks could amplify their utility.

In conclusion, this paper provides a comprehensive strategy to mitigate the drawbacks currently faced by diffusion-based models in text-image tasks, thereby contributing both improved methodologies and insightful theoretical groundwork for subsequent research and development.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Wenbo Li (114 papers)
Guohao Li (43 papers)
Zhibin Lan (7 papers)
Xue Xu (4 papers)
Wanru Zhuang (3 papers)
Jiachen Liu (45 papers)
Xinyan Xiao (41 papers)
Jinsong Su (96 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/jfischoff/status/1843854952969777604