- The paper presents MaRVL, a new dataset designed to evaluate vision-and-language models on culturally and linguistically diverse reasoning tasks.
- The experimental evaluation reveals a significant drop in cross-lingual transfer performance when models are applied beyond English contexts.
- The dataset was constructed with native speakers to ensure cultural relevance, highlighting the need for truly global and adaptable AI systems.
Visually Grounded Reasoning across Languages and Cultures
This paper addresses notable limitations in current vision-and-language benchmarks, particularly their linguistic and cultural biases. The authors introduce a novel dataset, Multicultural Reasoning over Vision and Language (MaRVL), which is designed to assess visually grounded reasoning across a diverse set of languages and cultural contexts, including Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. The authors acknowledge the influence and shortcomings of ImageNet-derived datasets, which are primarily grounded in Western, particularly English, context.
Dataset Construction
To overcome these biases, the authors present a systematic protocol to construct a culturally and linguistically varied ImageNet-style hierarchy. Unlike previous datasets built using automated processes, MaRVL’s concept and image selection involve native speakers to ensure cultural relevance. This addresses the gap in representation by incorporating specific concepts and imagery unique to each consulted culture. The dataset features true or false visual reasoning tasks requiring deep linguistic understanding and cross-modal integration, as exemplified by its annotations by native speakers.
Experimental Evaluation
The paper evaluates state-of-the-art vision-and-LLMs using MaRVL, highlighting a significant drop in cross-lingual transfer performance. Models struggle with domain shifts across concepts, images, and languages, revealing the limitations of current models when moving beyond English contexts. In both the zero-shot and translation-based cross-lingual transfer scenarios, models demonstrate a considerable performance decline.
Implications and Future Directions
The findings prompt a reevaluation of current visual-linguistic models' robustness and adaptability in diverse cultural and linguistic contexts. This also points towards the challenges in building genuinely multicultural and multilingual AI systems. The results underscore the need for models that can generalize across unseen cultures and languages—a crucial factor for real-world applicability in global settings.
Future developments in AI could explore more sophisticated transfer learning methodologies that account for the nuanced variations introduced by linguistic and cultural diversity. Enhancing models with cross-cultural cognitive capabilities might involve leveraging datasets like MaRVL in combination with innovations in multimodal learning and broader linguistic resources.
Conclusion
The paper contributes a significant advancement in assessing visual-linguistic reasoning on a culturally and linguistically diverse scale. By providing a challenging benchmark through MaRVL, the researchers not only highlight the limitations of existing models but also pave the way for future research that aims at fostering truly global AI technologies. The inclusion of diverse linguistic and cultural perspectives promises more equitable and representative technological advancements.