Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AnyText: Multilingual Visual Text Generation And Editing (2311.03054v5)

Published 6 Nov 2023 in cs.CV

Abstract: Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.

AnyText: Multilingual Visual Text Generation and Editing

The paper introduces AnyText, a novel approach focusing on multilingual visual text generation and editing through a diffusion-based model. While the advancement of text-to-image synthesis has made significant strides, the rendering of legible, high-quality text within generated images remains a notable challenge. Current methodologies often fail to deliver clear and accurate text, which is a critical component in various applications such as advertising, digital art, and more. AnyText aims to address these shortcomings by emphasizing accurate text rendering, particularly in multilingual contexts.

Methodology

AnyText consists of two primary modules within its diffusion pipeline: an auxiliary latent module and a text embedding module. These combined elements facilitate exceptional text generation and editing capabilities, detailed as follows:

Auxiliary Latent Module: This module integrates information such as text glyphs, positions, and masked images into a latent representation. The glyphs are rendered using a uniform font, while the position markers specify the locations of text regions. The masked image denotes areas that should be preserved during the text generation process. The module employs a series of convolutional layers to transform this auxiliary information into a latent feature map that aids in the accurate rendering of text.

Text Embedding Module: This module leverages an Optical Character Recognition (OCR) model to encode stroke data into embeddings. These embeddings are then integrated with image caption embeddings derived from a tokenizer. This approach ensures the generated text blends seamlessly with the background, maintaining high readability and fidelity.

Additionally, the training process incorporates text-control diffusion loss and a novel text perceptual loss. The text-control diffusion loss ensures coherence in the text generation, while the text perceptual loss further refines the accuracy of the text by focusing on pixel-level details in the image space.

Dataset and Benchmark

AnyText includes the introduction of the AnyWord-3M dataset, a large-scale multilingual text image dataset comprising 3 million image-text pairs annotated with OCR data. This dataset supports the robust performance of AnyText across multiple languages, such as Chinese, English, Japanese, and Korean. The dataset preparation involved rigorous filtering rules to ensure high-quality annotations and diverse text conditions. The authors also proposed the AnyText-benchmark, an evaluation set designed to assess the accuracy and quality of generated visual text.

Evaluation and Results

The authors conducted extensive evaluations against existing methods like ControlNet, TextDiffuser, and GlyphControl. Notably, AnyText demonstrated superior performance across several metrics:

  • Sentence Accuracy (Sen. ACC)
  • Normalized Edit Distance (NED)
  • Frechet Inception Distance (FID)

The metrics revealed that AnyText substantially outperforms competing models in both English and Chinese text generation tasks. In particular, the model achieves over 66% Sentence Accuracy for Chinese text generation, a notable feat given the complexities involved in rendering Chinese characters.

Implications and Future Developments

The development of AnyText is a significant step forward in the field of visual text generation, offering a solution that can render legible text in multiple languages with high fidelity. This model can be seamlessly integrated with existing diffusion models, broadening its applicability across various domains such as digital content creation, advertising, and interactive media.

The practical implications of this research are substantial. By resolving the persistent issue of poor text rendering in generated images, AnyText can enhance the utility of text-to-image models in real-world applications. It provides an improved foundation for further developments in the field, particularly in multilingual contexts where text complexity and diversity present significant challenges.

Future research directions could explore:

  • Enhancements in text rendering for highly stylized or extremely small fonts.
  • Expansion into additional languages and character sets.
  • Further refinement of the perceptual loss to enhance text style consistency with complex backgrounds.
  • Integration with more advanced OCR models to improve text recognition and generation accuracy.

In conclusion, AnyText marks a significant progression in the domain of visual text generation and editing, providing robust solutions to long-standing challenges in the field. The integration of sophisticated modules and the novel dataset AnyWord-3M underline the potential for future advancements and practical applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. ArT. Icdar2019 robust reading challenge on arbitrary-shaped text. https://rrc.cvc.uab.es/?ch=14, 2019.
  2. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint, 2022.
  3. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint, abs/2211.09800, 2022.
  4. Muse: Text-to-image generation via masked generative transformers. arXiv preprint, abs/2301.00704, 2023.
  5. Diffute: Universal text editing diffusion model. arXiv preprint, abs/2305.10825, 2023a.
  6. Textdiffuser: Diffusion models as text painters. arXiv preprint, abs/2305.10855, 2023b.
  7. Dreamidentity: Improved editability for efficient face-identity preserved image generation. arXiv preprint, abs/2307.00300, 2023c.
  8. COCO-Text. A large-scale scene text dataset based on mscoco. https://bgshih.github.io/cocotext, 2016.
  9. DeepFloyd-Lab. Deepfloyd if. https://github.com/deep-floyd/IF, 2023.
  10. Diffusion models beat gans on image synthesis. In NeurIPS, pp.  8780–8794, 2021.
  11. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  12. Wukong: 100 million large-scale chinese cross-modal pre-training dataset and A foundation framework. CoRR, abs/2202.06767, 2022.
  13. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  14. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint, abs/2302.09778, 2023.
  15. Midjourney Inc. Midjourney. https://www.midjourney.com/, 2022.
  16. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  17. Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR system. CoRR, abs/2206.03001, 2022.
  18. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, abs/2301.12597, 2023.
  19. Character-aware models improve visual text rendering. In ACL, pp.  16270–16297, 2023.
  20. LSVT. Icdar2019 robust reading challenge on large-scale street view text with partial labeling. https://rrc.cvc.uab.es/?ch=16, 2019.
  21. Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint, abs/2303.17870, 2023a.
  22. Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint, abs/2303.09319, 2023b.
  23. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  24. MLT. Icdar 2019 robust reading challenge on multi-lingual scene text detection and recognition. https://rrc.cvc.uab.es/?ch=15, 2019.
  25. ModelScope. Duguangocr. https://modelscope.cn/models/damo/cv_convnextTiny_ocr-recognition-general_damo/summary, 2023.
  26. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint, abs/2302.08453, 2023.
  27. MTWI. Icpr 2018 challenge on multi-type web images. https://tianchi.aliyun.com/dataset/137084, 2018.
  28. Improved denoising diffusion probabilistic models. In ICML, volume 139, pp.  8162–8171, 2021.
  29. PaddlePaddle. Pp-ocrv4. https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/PP-OCRv4_introduction.md, 2023.
  30. Zero-shot text-to-image generation. In ICML, volume 139, pp.  8821–8831. PMLR, 2021.
  31. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint, abs/2204.06125, 2022.
  32. RCTW. Icdar2017 competition on reading chinese text in the wild. https://rctw.vlrlab.net/dataset, 2017.
  33. ReCTS. Icdar 2019 robust reading challenge on reading chinese text on signboard. https://rrc.cvc.uab.es/?ch=12, 2019.
  34. OCR-VQGAN: taming text-within-image generation. In WACV, pp.  3678–3687, 2023.
  35. High-resolution image synthesis with latent diffusion models. In CVPR, pp.  10684–10695, June 2022.
  36. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  37. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. CoRR, abs/2111.02114, 2021.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  25278–25294. Curran Associates, Inc., 2022.
  39. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint, abs/2304.03411, 2023.
  40. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  41. ELITE: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint, abs/2302.13848, 2023.
  42. Glyphcontrol: Glyph conditional control for visual text generation. arXiv preprint, abs/2305.18259, 2023.
  43. Adding conditional control to text-to-image diffusion models. arXiv preprint, abs/2302.05543, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuxiang Tuo (3 papers)
  2. Wangmeng Xiang (19 papers)
  3. Jun-Yan He (27 papers)
  4. Yifeng Geng (30 papers)
  5. Xuansong Xie (69 papers)
Citations (43)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

  1. GitHub - tyxsspa/AnyText (4,694 stars)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. AnyText (33 points, 0 comments)
  2. AnyText: Multilingual Visual Text Generation and Editing (1 point, 1 comment)