GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models (2407.02252v2)
Abstract: Posters play a crucial role in marketing and advertising by enhancing visual communication and brand visibility, making significant contributions to industrial design. With the latest advancements in controllable T2I diffusion models, increasing research has focused on rendering text within synthesized images. Despite improvements in text rendering accuracy, the field of automatic poster generation remains underexplored. In this paper, we propose an automatic poster generation framework with text rendering capabilities leveraging LLMs, utilizing a triple-cross attention mechanism based on alignment learning. This framework aims to create precise poster text within a detailed contextual background. Additionally, the framework supports controllable fonts, adjustable image resolution, and the rendering of posters with descriptions and text in both English and Chinese.Furthermore, we introduce a high-resolution font dataset and a poster dataset with resolutions exceeding 1024 pixels. Our approach leverages the SDXL architecture. Extensive experiments validate our method's capability in generating poster images with complex and contextually rich backgrounds.Codes is available at https://github.com/OPPO-Mente-Lab/GlyphDraw2.
- Qwen Technical Report. arXiv preprint arXiv:2309.16609.
- TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering. arXiv preprint arXiv:2311.16465.
- Textdiffuser: Diffusion models as text painters. Advances in Neural Information Processing Systems, 36.
- Reason out your layout: Evoking the layout master from large language models for text-to-image synthesis. arXiv preprint arXiv:2311.17126.
- Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia, 1148–1156.
- Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. arXiv:2403.03206.
- Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36.
- Llm blueprint: Enabling text-to-image generation with complex and detailed prompts. arXiv preprint arXiv:2310.10640.
- Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35: 26418–26431.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
- SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation. arXiv:2308.10156.
- PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System. arXiv:2206.03001.
- Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 19730–19742. PMLR.
- Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461.
- LayoutPrompter: Awaken the Design Ability of Large Language Models. Advances in Neural Information Processing Systems, 36.
- Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering. arXiv preprint arXiv:2403.09622.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation. arXiv preprint arXiv:2311.17086.
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410.
- Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
- Compositional Text-to-Image Generation with Dense Blob Representations. arXiv preprint arXiv:2405.08246.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952.
- Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2): 3.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35: 36479–36494.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2149–2159.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
- AnyText: Multilingual Visual Text Generation And Editing. arXiv preprint arXiv:2311.03054.
- Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519.
- Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv:2306.09341.
- Baichuan 2: Open Large-scale Language Models. arXiv:2309.10305.
- GlyphControl: Glyph Conditional Control for Visual Text Generation. Advances in Neural Information Processing Systems, 36.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721.
- Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia, 6841–6850.
- ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models. arXiv preprint arXiv:2312.06573.
- Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model. arXiv:2312.12232.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847.
- Controllable text-to-image generation with gpt-4. arXiv preprint arXiv:2305.18583.
- UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models. arXiv preprint arXiv:2312.04884.
- Jian Ma (99 papers)
- Yonglin Deng (2 papers)
- Chen Chen (753 papers)
- Haonan Lu (35 papers)
- Zhenyu Yang (56 papers)