Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Visual Text Design Transfer Across Languages

Published 24 Oct 2024 in cs.CV and cs.AI | (2410.18823v2)

Abstract: Visual text design plays a critical role in conveying themes, emotions, and atmospheres in multimodal formats such as film posters and album covers. Translating these visual and textual elements across languages extends the concept of translation beyond mere text, requiring the adaptation of aesthetic and stylistic features. To address this, we introduce a novel task of Multimodal Style Translation (MuST-Bench), a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems while preserving design intent. Our initial experiments on MuST-Bench reveal that existing visual text generation models struggle with the proposed task due to the inadequacy of textual descriptions in conveying visual design. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions. SIGIL enhances image generation models through three innovations: glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. SIGIL outperforms existing baselines by achieving superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches. We release MuST-Bench publicly for broader use and exploration https://huggingface.co/datasets/yejinc/MuST-Bench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. AI Jaided. Easyocr, 2020. Retrieved October 9, 2020, from EasyOCR.
  2. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  3. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  4. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  5. Textdiffuser: Diffusion models as text painters. Advances in Neural Information Processing Systems, 36, 2024.
  6. DeepFloyd-Lab. Deepfloyd if. https://github.com/deep-floyd/IF.
  7. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  8. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  9. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  10. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  11. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  12. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
  13. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022.
  14. Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
  15. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  16. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  17. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  18. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  19. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  20. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  21. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  22. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  23. Ds-fusion: Artistic typography via discriminated and stylized diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 374–384, 2023.
  24. Anytext: Multilingual visual text generation and editing. In The Twelfth International Conference on Learning Representations, 2023.
  25. Self-supervised cross-language scene text editing. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4546–4554, 2023.
  26. Glyphcontrol: Glyph conditional control for visual text generation. Advances in Neural Information Processing Systems, 36, 2024.
  27. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  28. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.