On the Cultural Gap in Text-to-Image Generation (2307.02971v1)
Abstract: One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data, which signifies the disparity in generated image quality when the cultural elements of the input text are rarely collected in the training set. Although various T2I models have shown impressive but arbitrary examples, there is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. To bridge the gap, we propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture. By analyzing the flawed images generated by the Stable Diffusion model on the C3 benchmark, we find that the model often fails to generate certain cultural objects. Accordingly, we propose a novel multi-modal metric that considers object-text alignment to filter the fine-tuning data in the target culture, which is used to fine-tune a T2I model to improve cross-cultural generation. Experimental results show that our multi-modal metric provides stronger data selection performance on the C3 benchmark than existing metrics, in which the object-text alignment is crucial. We release the benchmark, data, code, and generated images to facilitate future research on culturally diverse T2I generation (https://github.com/longyuewangdcu/C3-Bench).
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- PALI: A jointly-scaled multilingual language-image model. In ICLR.
- AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities. arXiv:2211.06679.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In CILR.
- Language-agnostic BERT Sentence Embedding. In ACL.
- ERNIE-ViLG 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In CVPR.
- TranSmart: A Practical Interactive Machine Translation System. arXiv.
- Translation-Enhanced Multilingual Text-to-Image Generation. In ACL.
- Decoupled Weight Decay Regularization. In ICLR.
- Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. arXiv preprint arXiv:2306.09093.
- Cultural incongruencies in artificial intelligence. arXiv preprint arXiv:2211.13069.
- Learning transferable visual models from natural language supervision. In ICML.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
- Generative adversarial text to image synthesis. In International conference on machine learning, 1060–1069. PMLR.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
- Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13960–13969.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS.
- Multilingual Conceptual Coverage in Text-to-Image Models. In ACL.
- The biased artist: Exploiting cultural biases via homoglyphs in text-guided image generation models. arXiv preprint arXiv:2209.08891.
- Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1316–1324.
- Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, 5907–5915.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5802–5810.
- Bingshuai Liu (4 papers)
- Longyue Wang (87 papers)
- Chenyang Lyu (44 papers)
- Yong Zhang (660 papers)
- Jinsong Su (96 papers)
- Shuming Shi (126 papers)
- Zhaopeng Tu (135 papers)