BertaQA: How Much Do Language Models Know About Local Culture? (2406.07302v2)

Published 11 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models' performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at https://github.com/juletx/BertaQA.

PDF HTML Abstract

An Insightful Overview of "BertaQA: How Much Do LLMs Know About Local Culture?"

The presented work, "BertaQA: How Much Do LLMs Know About Local Culture?" addresses a significant gap in the evaluation of LLMs by focusing on their performance on culturally specific knowledge, particularly on minority cultures, exemplified through the Basque culture. This work counters the predominantly English-centric evaluations prevalent in existing research. Unlike traditional benchmarks, which focus on global or anglocentric subjects, the BertaQA dataset presents a balanced collection of local and global topics, making it a comprehensive tool for assessing the cultural competence of LLMs.

Contributions and Findings

Dataset Introduction and Composition:

The BertaQA dataset is unique in that it is both parallel and bilingual, containing questions in Basque and English. The dataset is divided into local questions pertinent to Basque culture and global questions of broader interest, covering diverse categories like Geography, Literature, Society, and Science. The dataset spans 4,756 multiple-choice questions and provides a well-rounded evaluation tool by encompassing varying difficulties.

LLM Performance on Local vs. Global Topics:

The evaluation of state-of-the-art LLMs like GPT-4 Turbo and Claude 3 Opus revealed that these models outperform on global topics significantly, achieving around 91.7% accuracy on the global subset compared to 72.2% on the local subset. This stark contrast underlines a significant limitation in the models' cultural adaptability and highlights how current LLMs are skewed towards broader, more universally present topics, likely due to the predominance of English in their training data.

Knowledge Transfer from Basque to English:

A remarkable finding is the substantial performance improvement in local topics through continued pretraining on Basque data, which positively impacts the performance even when the models are queried in English. For instance, continued pretraining on Llama 2 70B resulted in a 13.46-point improvement in local knowledge queries while showing a minor degradation in global knowledge accuracy. This finding is pivotal as it challenges the existing belief that training models on low-resource languages detracts from their performance in high-resource languages, presenting solid evidence of effective knowledge transfer from a low-resource language to English.

Intrinsic Language and Knowledge Relationship:

The comparative results from the Basque and English versions emphasize that LLMs do not encode knowledge in a completely language-agnostic manner. A model trained extensively on Basque data, such as the Latxa, performs better on local topics when queried in Basque compared to querying in English. This finding suggests that for culturally specific knowledge, models benefit from being queried in the language where the knowledge was originally encoded.

Evaluation of Translation-based Techniques:

When applying translation-based methods like translate-test and self-translate, results indicated these approaches are more effective for global questions than local ones. For example, translating Basque questions into English generally improved the performance of models on global topics but was less effective or even detrimental for local questions. This suggests the limitation of translation-based techniques in preserving cultural context and nuances critical for local subject comprehension.

Implications and Future Directions

The implications of these findings are multi-faceted:

For Model Training and Evaluation: This paper highlights the necessity of including local and culturally diverse datasets in LLM evaluations to provide a more accurate reflection of a model’s global applicability and cultural competence. Future research could expand on building similar datasets for other minority cultures and languages to diversify the evaluation benchmarks.
For Knowledge Transfer Techniques: The successful transfer of knowledge from Basque to English opens new avenues for improving LLMs' performance on low-resource languages without detracting from their proficiency in high-resource languages. This could guide future pretraining techniques and curriculum learning strategies that incorporate diverse linguistic datasets.
For Multilingual NLP Applications: Recognizing the imperfect nature of knowledge transfer between languages, developers of multilingual NLP applications may need to reconsider how they structure and query their models depending on the cultural and linguistic contexts they operate within. Greater emphasis on native language data during training could yield more culturally competent and context-aware models.

Conclusion

The BertaQA dataset and the accompanying findings significantly contribute to the ongoing discourse on LLM evaluation by underscoring the importance of cultural knowledge and local contexts. This work not only reveals critical gaps in current LLM capabilities but also offers pathways to enhance model performance through targeted pretraining and evaluation practices. Future research building on these insights could pave the way for more culturally sensitive and globally applicable LLMs.