Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance (2404.01247v3)

Published 1 Apr 2024 in cs.CL and cs.CV
An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

Abstract: Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://github.com/simran-khanuja/image-transcreation.

On Translating Images for Cultural Relevance: A Preliminary Exploration

Introduction

Transcreation, the process of adapting content to maintain its essence across cultures, has become increasingly relevant in our multimedia-rich world. This paper introduces a novel task aimed at transcreating images, making visual content culturally relevant. Despite advancements in LLMs and generative AI, the automatic cultural adaptation of visual content remains a largely unexplored frontier. This paper presents three pipelines using state-of-the-art generative models for image transcreation, a comprehensive evaluation dataset, and an extensive human evaluation to gauge the success of these models in culturally adapting images.

Pipelines for Image Transcreation

The task involves translating images to make them culturally relevant without losing their original essence. Three distinct pipelines are proposed:

  1. e2e-instruct: This pipeline leverages instruction-based image editing models to adapt images directly following natural language instructions, aiming for a one-step transformation process.
  2. cap-edit (caption -> LLM edit -> image edit): A modular approach that first generates a caption for the image, then modifies this caption to reflect cultural relevance using an LLM, and finally edits the original image based on this culturally adapted caption.
  3. cap-retrieve (caption -> LLM edit -> image retrieval): Similar to cap-edit in its initial steps but diverges by retrieving a relevant image from a country-specific dataset instead of editing the original image. This pipeline aims to find naturally occurring images that match the culturally adapted caption, potentially bypassing the limitations of direct image editing.

Evaluation Dataset

Given the novel nature of this task, a new evaluation dataset consisting of two parts was created:

  • Concept Dataset: This part contains 600 images that are inherently cross-culturally coherent. These images focus on a single concept and are categorized into universal categories like food, beverages, and celebrations, allowing for cross-cultural comparison.
  • Application Dataset: Comprising 100 images curated from real-world applications such as educational worksheets and children's literature, this dataset is meant to ground the task in practical applications.

Human Evaluation and Findings

A multi-faceted human evaluation was conducted to assess the cultural relevance and meaning preservation of the translated images. The findings reveal significant challenges:

  • Limited Success in Cultural Transcreation: Across the best pipelines, only 5% of images were successfully translated for some countries in the concept dataset, highlighting the task's difficulty. For the application dataset, some countries saw no successful translations.
  • Model Limitations: Current generative models, especially those focused on direct image editing, struggle to grasp and incorporate cultural context effectively. However, leveraging LLMs for textual guidance shows promise in improving outcomes.
  • Importance of Evaluation Dataset: The developed evaluation framework provides a starting point for assessing progress in this nascent area, revealing that the task requires significant further research to achieve satisfactory results.

Implications and Future Directions

This work highlights the complexity of culturally adapting visual content using AI models. The limited success rate underscores the current limitations of generative models in understanding and applying cultural nuances. Future research could explore more sophisticated models that can better grasp cultural contexts, possibly through enhanced training datasets or more advanced multimodal understanding. Additionally, exploring the balance between direct image editing and retrieval-based approaches may yield more effective strategies for image transcreation.

In conclusion, while promising, the journey of using AI to culturally transcreate images is just beginning. The findings of this paper outline both the potential and the pitfalls of current methodologies, setting the stage for further exploration in this intriguing intersection of AI, culture, and visual content adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Probing pre-trained language models for cross-cultural differences in values. arXiv preprint arXiv:2203.13722.
  3. Inspecting the geographical representativeness of images from text-to-image models. arXiv preprint arXiv:2305.11080.
  4. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402.
  5. Zeno: An interactive framework for behavioral evaluation of machine learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–14.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660.
  7. Frederic Chaume. 2018. Is audiovisual translation putting the concept of translation up against the ropes?
  8. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797.
  9. National Research Council et al. 2015. Transforming the workforce for children birth through age 8: A unifying foundation.
  10. John Dryden. 1694. Preface to examen poeticum. In Examen Poeticum.
  11. Media across borders: Localising TV, film and video games. Routledge.
  12. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.
  13. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423.
  14. Stuart Hall. 2015. Cultural identity and diaspora. In Colonial discourse and post-colonial theory, pages 392–403. Routledge.
  15. Implications for educational practice of the science of learning and development. Applied Developmental Science, 24(2):97–140.
  16. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
  17. George Ho. 2016. Translating advertisements across heterogeneous cultures. In Key Debates in the Translation of Advertising Material, pages 221–243. Routledge.
  18. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189.
  19. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134.
  20. Roman Jakobson. 1959. On linguistic aspects of translation. Harvard Educational Review, 29(1):232–239.
  21. Jerome. 384. Letter to pammachius. Translated in Kelly, J. N. (Ed.) (2009). Jerome: Letters (Vol. 1). Oxford University Press.
  22. Heidi Keinonen. 2016. Cultural negotiation in an early programme format: the finnish adaptation of romper room. New Patterns in Global Television Formats. Bristol: Intellect, pages 95–108.
  23. Mary Ritchie Key and editors Bernard Comrie. 2015. Ids. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  24. Ibn Khaldun. 1377. The Muqaddimah: An introduction to history.
  25. Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. arXiv preprint arXiv:2210.14712.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  27. Crossing the threshold: Idiomatic machine translation through retrieval augmentation and loss weighting. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15095–15111, Singapore. Association for Computational Linguistics.
  28. Visually grounded reasoning across languages and cultures. arXiv preprint arXiv:2109.13238.
  29. Albert Moran. 2009. Global franchising, local customizing: The cultural economy of tv program formats. Continuum, 23(2):115–125.
  30. Eugene A. Nida. 1964. Principles of correspondence in translating. Summer Institute of Linguistics.
  31. Few-shot image generation via cross-domain correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10743–10752.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  33. Nathalie Ramière. 2010. Are you" lost in translation"(when watching a foreign film)? towards an alternative approach to judging audiovisual translation. Australian Journal of French Studies, 47(1):100–115.
  34. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
  35. Juan José Martínez Sierra. 2008. Humor y traducción: Los Simpson cruzan la frontera. 15. Universitat Jaume I.
  36. Jeanette Steemers and Alessandro D’Arma. 2012. Evaluating and regulating the role of public broadcasters in the children’s media ecology: The case of home-grown television content. International Journal of Media & Cultural Politics, 8(1):67–85.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  38. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Simran Khanuja (19 papers)
  2. Sathyanarayanan Ramamoorthy (4 papers)
  3. Yueqi Song (11 papers)
  4. Graham Neubig (342 papers)
Citations (3)