CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark (2406.05967v2)

Published 10 Jun 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-LLMs to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal LLMs (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

PDF HTML Abstract

A Study on the CVQA Benchmark: Culturally-Diverse Multilingual Visual Question Answering

The CVQA (Culturally-diverse Multilingual Visual Question Answering) benchmark presents a novel dataset geared towards evaluating the performance of multimodal AI models on a culturally diversified and multilingual spectrum of visual question answering tasks. Spearheaded by a consortium of international authors, including David Romero, Chenyang Lyu, and Alham Fikri Aji, among others, the dataset traverses 28 countries and 26 languages, and it encompasses a myriad of cultural nuances across 10 categories and offers over 9,000 annotated questions.

Motivation and Construction

The prevalent VQA benchmarks largely encapsulate Western-centric images and are predominantly based in English. However, AI's expansive reach demands a more globalized dataset to better train and evaluate models across varied cultural contexts. The CVQA dataset was purposely constructed to address gaps by integrating culturally-representative images and questions across less represented languages and regions. The data collection process adopted a grassroots approach, embedding local native speakers and cultural experts to ensure a high degree of cultural fidelity in the questions and imagery.

Key Construction Details:

Data Collection: Contributors selected culturally relevant images from diverse categories, aligning with cultural knowledge and contexts. Personal images and open-license images formed the corpus of visual data.
Question Annotation: Questions and their respective choices were developed to align closely with visual content and cultural context, minimizing bias and maximizing cultural coverage.
Validation: Each data entry underwent meticulous validation, ensuring coherence and precision in annotations.

The benchmark includes rich categorical data covering vehicles, daily life, food, geography, pop culture, among others, ensuring a comprehensive representation of global imagery and culture.

Implementation and Evaluation

CVQA was evaluated using an assemblage of models that accommodate multilingual capabilities, including LLaVA-1.5-7B, M-CLIP, and mBLIP, alongside English-focused models like InstructBLIP. The evaluation underscored several challenges faced by current MLLMs, notably:

Performance Disparity: There exists a substantive performance gap between closed proprietary models and open models, such as GPT-4o outperforming by significant margins.
Language Sensitivity: Multimodal models revealed higher efficacy in answering questions posed in English as opposed to local languages, indicating room for enhancement in multilingual processing abilities.
Cultural Representation: Many models struggled with the cultural relevance of images, especially in categories like cooking and food, and public figures, where the model must interpret and understand culturally specific context and common-sense knowledge.

Implications and Future Prospects

The introduction of CVQA provides a framework for advancing the cultural and linguistic inclusivity of AI models, markedly breaking away from traditionally Western-centric datasets. By doing so, CVQA not only promotes technical progress in model development but also underlines ethical AI alignment with diverse global perspectives.

Looking ahead, future advancements should focus on:

Enhancing the models' perceptive abilities in culturally-specific scenarios.
Improving multilingual comprehension and performance parity across languages.
Exploring open-ended question formats without multiple-choice options to mimic real-world VQA scenarios more closely.

Conclusion

CVQA steers the narrative towards embracing global cultural diversity within the AI ecosystem of visual question answering. Its detailed construction and in-depth evaluations foster AI models that are not only robust in performance but also better attuned to the diverse tapestry of global cultures, paving the way for the emergence of more universally capable AI systems.