Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering (2403.09622v2)
Abstract: Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
- Rowel Atienza. Vision transformer for fast and efficient scene text recognition. In International Conference on Document Analysis and Recognition, pages 319–334. Springer, 2021.
- Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
- Improving image generation with better captions. 2023.
- Improving fine-grained understanding in image-text pre-training. arXiv preprint arXiv:2401.09865, 2024.
- Diffute: Universal text editing diffusion model.
- Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855, 2023a.
- Textdiffuser-2: Unleashing the power of language models for text rendering. arXiv preprint arXiv:2311.16465, 2023b.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
- Vision transformers need registers, 2023.
- Scaling rectified flow transformers for high-resolution image synthesis, 2024.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Improving diffusion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568, 2023.
- Cole: A hierarchical generation framework for graphic design. arXiv preprint arXiv:2311.16974, 2023.
- DeepFloyd Lab. Deepfloyd if. https://github.com/deep-floyd/IF, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Character-aware models improve visual text rendering. In Annual Meeting of the Association for Computational Linguistics, 2022.
- Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- OpenAI. Dall·e 3 system card. 2023.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Rethinking text segmentation: A novel dataset and a text-specific refinement approach. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12045–12055, 2021.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
- Glyphcontrol: Glyph conditional control for visual text generation, 2023.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
- Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model. arXiv preprint arXiv:2305.14014, 2023.
- Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. arXiv preprint arXiv:2312.04884, 2023.
- Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.
- Customization assistant for text-to-image generation. arXiv preprint arXiv:2312.03045, 2023.
- Zeyu Liu (54 papers)
- Weicong Liang (6 papers)
- Zhanhao Liang (4 papers)
- Chong Luo (58 papers)
- Ji Li (186 papers)
- Gao Huang (178 papers)
- Yuhui Yuan (42 papers)