AnyText: Multilingual Visual Text Generation And Editing (2311.03054v5)

Published 6 Nov 2023 in cs.CV

Abstract: Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.

PDF HTML Abstract

AnyText: Multilingual Visual Text Generation and Editing

The paper introduces AnyText, a novel approach focusing on multilingual visual text generation and editing through a diffusion-based model. While the advancement of text-to-image synthesis has made significant strides, the rendering of legible, high-quality text within generated images remains a notable challenge. Current methodologies often fail to deliver clear and accurate text, which is a critical component in various applications such as advertising, digital art, and more. AnyText aims to address these shortcomings by emphasizing accurate text rendering, particularly in multilingual contexts.

Methodology

AnyText consists of two primary modules within its diffusion pipeline: an auxiliary latent module and a text embedding module. These combined elements facilitate exceptional text generation and editing capabilities, detailed as follows:

Auxiliary Latent Module: This module integrates information such as text glyphs, positions, and masked images into a latent representation. The glyphs are rendered using a uniform font, while the position markers specify the locations of text regions. The masked image denotes areas that should be preserved during the text generation process. The module employs a series of convolutional layers to transform this auxiliary information into a latent feature map that aids in the accurate rendering of text.

Text Embedding Module: This module leverages an Optical Character Recognition (OCR) model to encode stroke data into embeddings. These embeddings are then integrated with image caption embeddings derived from a tokenizer. This approach ensures the generated text blends seamlessly with the background, maintaining high readability and fidelity.

Additionally, the training process incorporates text-control diffusion loss and a novel text perceptual loss. The text-control diffusion loss ensures coherence in the text generation, while the text perceptual loss further refines the accuracy of the text by focusing on pixel-level details in the image space.

Dataset and Benchmark

AnyText includes the introduction of the AnyWord-3M dataset, a large-scale multilingual text image dataset comprising 3 million image-text pairs annotated with OCR data. This dataset supports the robust performance of AnyText across multiple languages, such as Chinese, English, Japanese, and Korean. The dataset preparation involved rigorous filtering rules to ensure high-quality annotations and diverse text conditions. The authors also proposed the AnyText-benchmark, an evaluation set designed to assess the accuracy and quality of generated visual text.

Evaluation and Results

The authors conducted extensive evaluations against existing methods like ControlNet, TextDiffuser, and GlyphControl. Notably, AnyText demonstrated superior performance across several metrics:

Sentence Accuracy (Sen. ACC)
Normalized Edit Distance (NED)
Frechet Inception Distance (FID)

The metrics revealed that AnyText substantially outperforms competing models in both English and Chinese text generation tasks. In particular, the model achieves over 66% Sentence Accuracy for Chinese text generation, a notable feat given the complexities involved in rendering Chinese characters.

Implications and Future Developments

The development of AnyText is a significant step forward in the field of visual text generation, offering a solution that can render legible text in multiple languages with high fidelity. This model can be seamlessly integrated with existing diffusion models, broadening its applicability across various domains such as digital content creation, advertising, and interactive media.

The practical implications of this research are substantial. By resolving the persistent issue of poor text rendering in generated images, AnyText can enhance the utility of text-to-image models in real-world applications. It provides an improved foundation for further developments in the field, particularly in multilingual contexts where text complexity and diversity present significant challenges.

Future research directions could explore:

Enhancements in text rendering for highly stylized or extremely small fonts.
Expansion into additional languages and character sets.
Further refinement of the perceptual loss to enhance text style consistency with complex backgrounds.
Integration with more advanced OCR models to improve text recognition and generation accuracy.

In conclusion, AnyText marks a significant progression in the domain of visual text generation and editing, providing robust solutions to long-standing challenges in the field. The integration of sophisticated modules and the novel dataset AnyWord-3M underline the potential for future advancements and practical applications.