Visually Grounded Reasoning across Languages and Cultures (2109.13238v2)

Published 28 Sep 2021 in cs.CL, cs.AI, and cs.CV

Abstract: The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for {M}ulticultur{a}l {R}easoning over {V}ision and {L}anguage (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.

Authors (6)

Fangyu Liu (59 papers)
Emanuele Bugliarello (27 papers)
Edoardo Maria Ponti (24 papers)
Siva Reddy (82 papers)
Nigel Collier (83 papers)
Desmond Elliott (53 papers)

Citations (152)

View on Semantic Scholar

Summary

Visually Grounded Reasoning across Languages and Cultures

This paper addresses notable limitations in current vision-and-language benchmarks, particularly their linguistic and cultural biases. The authors introduce a novel dataset, Multicultural Reasoning over Vision and Language (MaRVL), which is designed to assess visually grounded reasoning across a diverse set of languages and cultural contexts, including Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. The authors acknowledge the influence and shortcomings of ImageNet-derived datasets, which are primarily grounded in Western, particularly English, context.

Dataset Construction

To overcome these biases, the authors present a systematic protocol to construct a culturally and linguistically varied ImageNet-style hierarchy. Unlike previous datasets built using automated processes, MaRVL’s concept and image selection involve native speakers to ensure cultural relevance. This addresses the gap in representation by incorporating specific concepts and imagery unique to each consulted culture. The dataset features true or false visual reasoning tasks requiring deep linguistic understanding and cross-modal integration, as exemplified by its annotations by native speakers.

Experimental Evaluation

The paper evaluates state-of-the-art vision-and-LLMs using MaRVL, highlighting a significant drop in cross-lingual transfer performance. Models struggle with domain shifts across concepts, images, and languages, revealing the limitations of current models when moving beyond English contexts. In both the zero-shot and translation-based cross-lingual transfer scenarios, models demonstrate a considerable performance decline.

Implications and Future Directions

The findings prompt a reevaluation of current visual-linguistic models' robustness and adaptability in diverse cultural and linguistic contexts. This also points towards the challenges in building genuinely multicultural and multilingual AI systems. The results underscore the need for models that can generalize across unseen cultures and languages—a crucial factor for real-world applicability in global settings.

Future developments in AI could explore more sophisticated transfer learning methodologies that account for the nuanced variations introduced by linguistic and cultural diversity. Enhancing models with cross-cultural cognitive capabilities might involve leveraging datasets like MaRVL in combination with innovations in multimodal learning and broader linguistic resources.

Conclusion

The paper contributes a significant advancement in assessing visual-linguistic reasoning on a culturally and linguistically diverse scale. By providing a challenging benchmark through MaRVL, the researchers not only highlight the limitations of existing models but also pave the way for future research that aims at fostering truly global AI technologies. The inclusion of diverse linguistic and cultural perspectives promises more equitable and representative technological advancements.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - e-bug/volta: [TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs" (114 stars)