Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models (2407.02067v2)
Abstract: We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline, CultureAdapt. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DalleStreet and other existing benchmarks, which we try to understand using over 18,000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads
- Towards measuring and modeling "culture" in llms: A survey.
- Stability AI. 2024. Stable diffusion 3 released. https://stability.ai/news/stable-diffusion-3.
- Inspecting the geographical representativeness of images from text-to-image models.
- Exploring visual culture awareness in gpt-4v: A comprehensive probing.
- Revolt: Collaborative crowdsourcing for labeling machine learning datasets. CHI ’17, page 2334–2346, New York, NY, USA. Association for Computing Machinery.
- Zero-shot image editing with reference imitation.
- Bertaqa: How much do language models know about local culture? Preprint, arXiv:2406.07302.
- Massively multi-cultural knowledge acquisition & lm benchmarking.
- How culture shapes what people want from ai. ArXiv preprint, abs/2403.05104.
- Geoguessr. 2024. Geoguessr: A geography game. https://www.geoguessr.com.
- Dig in: Evaluating disparities in image generations with indicators for geographic diversity.
- Challenges and strategies in cross-cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6997–7013, Dublin, Ireland. Association for Computational Linguistics.
- CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Tag2text: Guiding vision-language model via image tagging.
- Cross-cultural inspiration detection and analysis in real and llm-generated social media data.
- Visage: A global-scale analysis of visual stereotypes in text-to-image generation.
- An image speaks a thousand words, but can everyone listen? on image transcreation for cultural relevance.
- The prism alignment project: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.
- A. L. Kroeber. 1952. Culture: A Critical Review of Concepts and Definitions. The Museum, Cambridge, Mass. Retrieved from https://nrs.lib.harvard.edu/urn-3:fhcl:30362985. Accessed 11 June 2024.
- Quantifying the carbon emissions of machine learning.
- Culturellm: Incorporating cultural differences into large language models.
- Culturepark: Boosting cross-cultural understanding in large language models.
- Zhi Li and Yin Zhang. 2023. Cultural concept adaptation on multimodal reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 262–276, Singapore. Association for Computational Linguistics.
- Culturally aware and adapted nlp: A taxonomy and a survey of the state of the art.
- Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Visual instruction tuning.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection.
- Modeling color terminology across thousands of languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2241–2250, Hong Kong, China. Association for Computational Linguistics.
- OpenAI. 2024. Dall·e 3 technical report. https://cdn.openai.com/papers/dall-e-3.pdf. [Accessed: June 9, 2024].
- Gpt-4 technical report.
- Trak: Attributing model behavior at scale.
- No filter: Cultural and socioeconomic diversity in contrastive vision-language models.
- Valor-eval: Holistic coverage and faithfulness evaluation of large vision-language models.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
- Grounded sam: Assembling open-world models for diverse visual tasks.
- The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world. In Neural Information Processing Systems.
- High-resolution image synthesis with latent diffusion models.
- Laion-5b: An open large-scale dataset for training next generation image-text models.
- Coimagining the future of voice assistants with cultural sensitivity. ArXiv preprint, abs/2403.17599.
- Generalized people diversity: Learning a human perception-aligned diversity representation for people images.
- Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
- UN. 2024. Methodology: Standard country or area codes for statistical use (m49).
- Givl: Improving geographical inclusivity of vision-language models with pre-training methods.
- Open-vocabulary object detection using captions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 14393–14402. Computer Vision Foundation / IEEE.
- Recognize anything: A strong image tagging model.
- Anjishnu Mukherjee (6 papers)
- Ziwei Zhu (59 papers)
- Antonios Anastasopoulos (111 papers)