BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages (2406.09948v1)

Published 14 Jun 2024 in cs.CL

Abstract: LLMs often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

PDF HTML Abstract

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

The paper "BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages" addresses a critical gap in the current landscape of LLMs: the lack of culture-specific and mundane everyday knowledge, particularly for underrepresented cultures and languages. The benchmark, named BLEnD, is meticulously crafted to evaluate LLMs' proficiency in understanding and generating culturally relevant content across a diverse set of cultures and languages.

Key Contributions

The paper makes several significant contributions:

Dataset Creation: BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, encapsulating 13 different languages including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. Questions span six categories: food, sports, family, education, holidays/celebrations/leisure, and work-life.
Evaluation Protocols: BLEnD offers two evaluation formats: short-answer questions and multiple-choice questions. The dataset construction involves rigorous steps including question collection, filtering, translation, and answer annotation to ensure quality and diversity.
Experimental Results: The evaluation of 16 LLMs on BLEnD reveals significant performance disparities, particularly highlighting the models' biases towards highly represented cultures and languages.

Methodology

The benchmark dataset is constructed through a multi-step process:

Question Collection and Filtering: Native annotators from various countries generate culturally relevant questions. These questions are filtered to eliminate duplicates or overly specific items, resulting in a diverse set of question templates that are then localized by replacing "your country" with the respective country names.
Answer Annotation: Native speakers annotate the answers in their local languages and English, ensuring cultural accuracy. The process is rigorous, with mechanisms to replace "I don't know" responses and aggregate multiple answers to account for variations.
Answer Aggregation: Annotators review and translate annotations to English, ensuring the final dataset is comprehensive and consistent across languages.

Experimental Findings

The experimental evaluation presents several key insights:

Performance Disparities: State-of-the-art models like GPT-4 exhibit a stark performance difference across cultures. For example, the performance drops significantly from highly represented cultures like the U.S. and Spain to underrepresented ones like Northern Nigeria and Ethiopia.
Language Proficiency: For cultures represented by mid-to-high-resource languages, LLMs perform better in local languages than in English. However, for low-resource languages, models perform better when prompted in English.
Region-Centric Models: LLMs developed in non-Western countries, such as Qwen1.5-72B (Alibaba) and HyperCLOVA-X (NAVER), demonstrate higher performance for their respective locales, showcasing the importance of localized training data.

Implications and Future Work

The findings from BLEnD have profound implications:

Model Training: There's a critical need for more diverse and representative training datasets to reduce biases and improve LLM performance in underrepresented cultures and languages.
Contextual Understanding: Evaluations highlight the necessity for models to understand not just linguistic nuances but also cultural contexts, which can be particularly challenging for subjective topics like food and leisure.
Performance Metrics: The disparity in performance across different question categories suggests that future benchmarks should include various types of questions to comprehensively assess cultural adaptiveness.

Conclusion

BLEnD stands as a pivotal benchmark, shedding light on the cultural biases embedded in current LLMs and underscoring the need for culturally nuanced models. Future developments in AI should focus on incorporating diverse cultural knowledge to make LLMs more universally applicable and equitable. The research community, through BLEnD, now has a robust tool to measure and enhance the cultural sensitivity of LLMs, paving the way for more inclusive AI applications.