Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support (2401.14688v3)

Published 26 Jan 2024 in cs.CL

Abstract: Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-LLM, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.Furthermore, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at \href{https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/}, fostering further research and collaboration in this domain.

Introduction

The field of text-to-image generation has recently witnessed the introduction of Taiyi-Diffusion-XL, a model that displays a significant leap in bilingual text-to-image synthesis. This model pivots from the norm where text-to-image generation is predominantly English-centric, requiring the cumbersome step of translating from other languages into English to utilize such advanced models. The Taiyi-Diffusion-XL overcomes these barriers by efficiently encoding and generating images from both Chinese and English text prompts, maintaining fidelity to each language's cultural and linguistic nuances.

Methodological Innovations

The development of Taiyi-Diffusion-XL involved multifaceted enhancements to pre-training approaches traditionally seen in models like CLIP. The two-phase methodology includes refining the dataset preparation with images paired with high-quality, detailed text descriptions. For CLIP model training, Taiyi-Diffusion-XL initializes with an English pre-trained version and adapts it using a bilingual dataset, notably improving the retrieval abilities in both languages. The subsequent Taiyi-XL training rests on a time-conditional UNet architecture and a loss function designed for multi-resolution denoising training processes. The implementation goes beyond the norm to address the complexities of bilingual datasets, resulting in a robust model that efficiently generates images from detailed textual prompts in both English and Chinese.

Empirical Validation

Extensive empirical analysis underscores Taiyi-Diffusion-XL's superiority over existing models. It achieves leading performance metrics in image-text retrieval and image generation quality, according to evaluations like CLIP Similarity, Inception Score, and Fréchet Inception Distance. These results emerge from exhaustive comparisons with benchmark models in bilingual text-to-image synthesis.

Implications and Future Research

Taiyi-Diffusion-XL represents a substantial contribution to the field of AI and multimedia generation, emphasizing the importance of inclusivity in language support. It paves the way for further studies in areas necessitating deep comprehension of bilingual textual descriptions for accurate image generation. By making the Taiyi-Diffusion-XL model openly available to researchers and developers, it invites extensive collaboration, with potential ramifications across numerous domains where bilingual multi-modal AI can be leveraged.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Wasserstein generative adversarial networks. In International conference on machine learning, pp.  214–223. PMLR, 2017.
  2. Improving image generation with better captions. openai cdn.openai.com/papers/dall-e-3.pdf, 2023.
  3. A survey on generative diffusion model. arXiv preprint arXiv:2209.02646, 2022.
  4. Disco-clip: A distributed contrastive loss for memory efficient clip training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22648–22657, 2023.
  5. Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, 2019.
  7. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  8. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  9. Ziya2: Data-centric learning is all llms need. arXiv preprint arXiv:2311.03301, 2023.
  10. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  11. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  13. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  14. Adding chinese captions to images. In Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp.  271–275, 2016.
  15. Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9):2347–2360, 2019.
  16. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  17. Lyrics: Boosting fine-grained language-vision alignment and comprehension via semantic-aware visual objects. arXiv preprint arXiv:2312.05278, 2023a.
  18. Ziya-vl: Bilingual large vision-language model via multi-task instruction tuning. arXiv preprint arXiv:2310.08166, 2023b.
  19. Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023a.
  20. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023b.
  21. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  23. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  24. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  25. Variational inference with normalizing flows. In International conference on machine learning, pp.  1530–1538. PMLR, 2015.
  26. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, June 2022.
  27. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  28. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  29. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  30. If: Title of the repository, 2023.
  31. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  32. Consistency models. 2023.
  33. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  34. Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud. arXiv preprint arXiv:2309.05534, 2023.
  35. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970, 2022.
  36. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022.
  37. Altdiffusion: A multilingual text-to-image diffusion model. ArXiv, abs/2308.09991, 2023. URL https://api.semanticscholar.org/CorpusID:261048720.
  38. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  39. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. arXiv preprint arXiv:2209.02970, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xiaojun Wu (94 papers)
  2. Dixiang Zhang (7 papers)
  3. Ruyi Gan (14 papers)
  4. Junyu Lu (32 papers)
  5. Ziwei Wu (19 papers)
  6. Renliang Sun (17 papers)
  7. Jiaxing Zhang (39 papers)
  8. Pingjian Zhang (9 papers)
  9. Yan Song (91 papers)
Citations (3)
Youtube Logo Streamline Icon: https://streamlinehq.com