Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception (2310.14356v3)
Abstract: Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is measured using scene graphs, model embeddings, and linguistic taxonomies. For example, multilingual descriptions have on average 29.9% more objects, 24.5% more relations, and 46.0% more attributes than a set of monolingual captions. When prompted to describe images in different languages, popular models (e.g. LLaVA) inherit this bias and describe different parts of the image. Moreover, finetuning models on captions from one language performs best on corresponding test data from that language, while finetuning on multilingual data performs consistently well across all test data compositions. Our work points towards the need to account for and embrace the diversity of human perception in the computer vision community.
- Spice: Semantic propositional image caption evaluation. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), Computer Vision – ECCV 2016, pp. 382–398, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46454-1.
- The inclusive images competition. In The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations, pp. 155–186. Springer, 2020.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
- Understanding and predicting importance in images. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3562–3569, 2012.
- Steven Bird. Decolonising speech and language technology. In International Conference on Computational Linguistics, 2020.
- On the opportunities and risks of foundation models. ArXiv, abs/2108.07258, 2021.
- Lera Boroditsky. Linguistic relativity. Encyclopedia of cognitive science, 2006.
- Sex, syntax, and semantics. Language in mind: Advances in the study of language and thought, 22:61–79, 2003.
- Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods, 46:904–911, 2014. URL https://link.springer.com/article/10.3758/s13428-013-0403-5.
- Gender shades: Intersectional accuracy disparities in commercial gender classification. In Sorelle A. Friedler and Christo Wilson (eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp. 77–91. PMLR, 23–24 Feb 2018. URL https://proceedings.mlr.press/v81/buolamwini18a.html.
- Gender bias in word embeddings: A comprehensive analysis of frequency, syntax, and semantics. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’22, pp. 156–170, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392471. doi: 10.1145/3514094.3534162. URL https://doi.org/10.1145/3514094.3534162.
- Cross-cultural differences in visual perception. The Journal of Education, Culture, and Society, 2015:187–206, 2020.
- Pali: A jointly-scaled multilingual language-image model. ArXiv, abs/2209.06794, 2022. URL https://arxiv.org/abs/2209.06794.
- Scene graph parsing via Abstract Meaning Representation in pre-trained language models. In Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022), pp. 30–35, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.dlg4nlp-1.4. URL https://aclanthology.org/2022.dlg4nlp-1.4.
- Noam Chomsky. Knowledge of language: Its nature, origin, and use. Greenwood Publishing Group, 1986.
- Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022. URL https://arxiv.org/abs/2210.11416.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
- Developmental influences on geometric illusion susceptibility among hong kong chinese children. Journal of Cross-Cultural Psychology, 4:49 – 74, 1973.
- Ferdinand de Saussure. Course in general linguistics. 1960.
- Dagmar Divjak. Frequency in Language: Memory, Attention and Learning. Cambridge University Press, 2019. doi: 10.1017/9781316084410. URL https://doi.org/10.1017/9781316084410.
- Interesting objects are visually salient. Journal of vision, 8 3:3.1–15, 2008.
- Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning. J. Artif. Intell. Res., 73:1131–1207, 2022. URL https://api.semanticscholar.org/CorpusID:248052909.
- Visual attention and the acquisition of information in human crowds. Proceedings of the National Academy of Sciences, 109(19):7245–7250, 2012.
- The influence of emotional facial expressions on gaze-following in grouped and solitary pedestrians. Scientific reports, 4(1):5794, 2014.
- Leon J Goldstein. On defining culture. American Anthropologist, 59(6):1075–1081, 1957.
- The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–14, 2021.
- Edward T Hall. Beyond culture. Anchor, 1976.
- Shin-Ichi Harada. Honorifics. In Japanese generative grammar, pp. 499–561. Brill, 1976.
- Challenges and strategies in cross-cultural nlp. ArXiv, abs/2203.10020, 2022.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. ArXiv, abs/2303.11897, 2023. URL https://arxiv.org/abs/2303.11897.
- Underspecification in scene description-to-depiction tasks. arXiv preprint arXiv:2210.05815, 2022.
- Pictorial depth perception: a developmental study. British journal of psychology, 65 1:141–9, 1974.
- Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247, 2020.
- Densecap: Fully convolutional localization networks for dense captioning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4565–4574, 2015a. URL https://arxiv.org/abs/1511.07571.
- Image retrieval using scene graphs. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678, 2015b. doi: 10.1109/CVPR.2015.7298990.
- Visual persuasion: Inferring communicative intents of images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–223, 2014.
- Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences, 118, 2021.
- Analytic versus holistic cognition: Constructs and measurement. 2018.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32 – 73, 2016. URL https://arxiv.org/abs/1602.07332.
- Different native languages as proxy for cultural differences in travel behaviour: insights from multilingual switzerland. International Journal of Culture, Tourism and Hospitality Research, 8(2):140–152, 2014.
- From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699, 2023.
- George Lakoff. Fire, Women, and Dangerous Things. University of Chicago Press, 1987.
- Ronald W Langacker. Universals of construal. In Annual Meeting of the Berkeley Linguistics Society, volume 19, pp. 447–463, 1993.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023a. URL https://arxiv.org/abs/2301.12597.
- FACTUAL: A benchmark for faithful and consistent textual scene graph parsing. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 6377–6390, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.398. URL https://aclanthology.org/2023.findings-acl.398.
- Visually grounded reasoning across languages and cultures. arXiv preprint arXiv:2109.13238, 2021.
- Visual instruction tuning. ArXiv, abs/2304.08485, 2023. URL https://arxiv.org/abs/2304.08485.
- Saliency based subject selection for diverse image captioning. In 2021 17th International Conference on Machine Vision and Applications (MVA), pp. 1–5. MVA Organization, 2021. ISBN 4901122207.
- Attending holistically versus analytically: comparing the context sensitivity of japanese and americans. Journal of personality and social psychology, 81 5:922–34, 2001.
- Effect of contextual factors on patterns of eye-movement: Comparing sensitivity to background information between japanese and westerners. Japanese Journal of Psychology, 2008.
- George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, nov 1995. ISSN 0001-0782. doi: 10.1145/219717.219748. URL https://doi.org/10.1145/219717.219748.
- A hindi image caption generation framework using deep learning. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 20(2), mar 2021. ISSN 2375-4699. doi: 10.1145/3432246. URL https://doi.org/10.1145/3432246.
- Artelingo: A million emotion annotations of wikiart with emphasis on diversity over language and culture. ArXiv, abs/2211.10780, 2022. URL https://arxiv.org/abs/2211.10780.
- Donald J Munro. Individualism and holism: Studies in confucian and taoist values. 1985.
- Improving multimodal datasets with image captioning. arXiv preprint arXiv:2307.10350, 2023.
- Culture and point of view. Proceedings of the National Academy of Sciences of the United States of America, 100:11163 – 11170, 2003.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Daphna Oyserman and Spike W. S. Lee. Does culture influence what and how we think? effects of priming individualism and collectivism. Psychological bulletin, 134 2:311–42, 2008. URL https://pubmed.ncbi.nlm.nih.gov/18298274/.
- Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pp. 311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
- Ai-based request augmentation to increase crowdsourcing participation. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pp. 115–124, 2019.
- The müller-lyer illusion among navajos. The Journal of social psychology, 121 1st Half:3–6, 1983.
- Draw mir a sheep: A supersense-based analysis of german case and adposition semantics. KI-Künstliche Intelligenz, 35(3-4):291–306, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
- Two contrasting data annotation paradigms for subjective nlp tasks. arXiv preprint arXiv:2112.07475, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402, 2022. URL https://arxiv.org/abs/2210.08402.
- A step toward more inclusive people annotations for fairness. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80, 2015.
- The influence of culture on visual perception. 1967.
- The ciede2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations. Color Research and Application, 30:21–30, 2005. URL https://doi.org/10.1002/col.20070.
- Risk, race, & recidivism: Predictive bias and disparate impact. Political Economy: Structure & Scope of Government eJournal, 2016. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2687339.
- Some objects are more equal than others: Measuring and predicting importance. In European Conference on Computer Vision, 2008.
- The natural statistics of blur. Journal of vision, 16(10):23–23, 2016.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2443–2449, 2021.
- Agile modeling: From concept to classifier in minutes. In International Conference on Computer Vision, 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Stretcu_Agile_Modeling_From_Concept_to_Classifier_in_Minutes_ICCV_2023_paper.pdf.
- A systematic literature review of human-centered, ethical, and responsible ai. 2023. URL https://arxiv.org/pdf/2302.05284.pdf.
- Leonard Talmy. Force dynamics in language and cognition. Cognitive Science, 12(1):49–100, 1988. ISSN 0364-0213. doi: https://doi.org/10.1016/0364-0213(88)90008-0. URL https://www.sciencedirect.com/science/article/pii/0364021388900080.
- The psychological meaning of words: Liwc and computerized text analysis methods. Journal of Language and Social Psychology, 29(1):24–54, 2010. doi: 10.1177/0261927X09351676. URL https://doi.org/10.1177/0261927X09351676.
- Paying attention to descriptions generated by image captioning models. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2506–2515, 2017.
- Human-centric ai: philosophical and community-centric considerations. AI & SOCIETY, 2023. URL https://doi.org/10.1007/s00146-023-01694-1.
- Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In Conference on Empirical Methods in Natural Language Processing, 2022.
- Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73, jan 2016. ISSN 0001-0782. doi: 10.1145/2812802. URL https://doi.org/10.1145/2812802.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971.
- Git: A generative image-to-text transformer for vision and language. ArXiv, abs/2205.14100, 2022. URL https://arxiv.org/abs/2205.14100.
- Social signals in primate orbitofrontal cortex. Current Biology, 22(23):2268–2273, 2012.
- A randomized controlled trial examining a tranquil sitting intervention compatible with confucian values. Frontiers in Psychology, 14, 2023.
- Max Wertheimer. Experimentelle studien uber das sehen von bewegung. Zeitschrift fur psychologie, 61:161–165, 1912.
- Human-in-the-loop for computer vision assurance: A survey. Engineering Applications of Artificial Intelligence, 2023.
- Ludwig Wittgenstein. Philosophical Investigations. Wiley-Blackwell, New York, NY, USA, 1953.
- Selective attention in peacocks during predator detection. Animal cognition, 17:767–777, 2014.
- Studying relationships between human gaze, description, and computer vision. 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 739–746, 2013.
- Andre Ye (7 papers)
- Sebastin Santy (15 papers)
- Jena D. Hwang (36 papers)
- Amy X. Zhang (58 papers)
- Ranjay Krishna (116 papers)