BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
The paper "BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages" addresses a critical gap in the current landscape of LLMs: the lack of culture-specific and mundane everyday knowledge, particularly for underrepresented cultures and languages. The benchmark, named BLEnD, is meticulously crafted to evaluate LLMs' proficiency in understanding and generating culturally relevant content across a diverse set of cultures and languages.
Key Contributions
The paper makes several significant contributions:
- Dataset Creation: BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, encapsulating 13 different languages including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. Questions span six categories: food, sports, family, education, holidays/celebrations/leisure, and work-life.
- Evaluation Protocols: BLEnD offers two evaluation formats: short-answer questions and multiple-choice questions. The dataset construction involves rigorous steps including question collection, filtering, translation, and answer annotation to ensure quality and diversity.
- Experimental Results: The evaluation of 16 LLMs on BLEnD reveals significant performance disparities, particularly highlighting the models' biases towards highly represented cultures and languages.
Methodology
The benchmark dataset is constructed through a multi-step process:
- Question Collection and Filtering: Native annotators from various countries generate culturally relevant questions. These questions are filtered to eliminate duplicates or overly specific items, resulting in a diverse set of question templates that are then localized by replacing "your country" with the respective country names.
- Answer Annotation: Native speakers annotate the answers in their local languages and English, ensuring cultural accuracy. The process is rigorous, with mechanisms to replace "I don't know" responses and aggregate multiple answers to account for variations.
- Answer Aggregation: Annotators review and translate annotations to English, ensuring the final dataset is comprehensive and consistent across languages.
Experimental Findings
The experimental evaluation presents several key insights:
- Performance Disparities: State-of-the-art models like GPT-4 exhibit a stark performance difference across cultures. For example, the performance drops significantly from highly represented cultures like the U.S. and Spain to underrepresented ones like Northern Nigeria and Ethiopia.
- Language Proficiency: For cultures represented by mid-to-high-resource languages, LLMs perform better in local languages than in English. However, for low-resource languages, models perform better when prompted in English.
- Region-Centric Models: LLMs developed in non-Western countries, such as Qwen1.5-72B (Alibaba) and HyperCLOVA-X (NAVER), demonstrate higher performance for their respective locales, showcasing the importance of localized training data.
Implications and Future Work
The findings from BLEnD have profound implications:
- Model Training: There's a critical need for more diverse and representative training datasets to reduce biases and improve LLM performance in underrepresented cultures and languages.
- Contextual Understanding: Evaluations highlight the necessity for models to understand not just linguistic nuances but also cultural contexts, which can be particularly challenging for subjective topics like food and leisure.
- Performance Metrics: The disparity in performance across different question categories suggests that future benchmarks should include various types of questions to comprehensively assess cultural adaptiveness.
Conclusion
BLEnD stands as a pivotal benchmark, shedding light on the cultural biases embedded in current LLMs and underscoring the need for culturally nuanced models. Future developments in AI should focus on incorporating diverse cultural knowledge to make LLMs more universally applicable and equitable. The research community, through BLEnD, now has a robust tool to measure and enhance the cultural sensitivity of LLMs, paving the way for more inclusive AI applications.