MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs (2505.21693v2)

Published 27 May 2025 in cs.CL

Abstract: LLMs are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata's multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

Summary

MakiEval: A Multilingual Framework for Cultural Awareness Evaluation

The paper "MakiEval: A Multilingual Automatic Wikidata-based Framework for Cultural Awareness Evaluation for LLMs" introduces an innovative framework that addresses critical challenges in assessing the cultural awareness expressed by LLMs. Given the increasing deployment of LLMs across global applications, there is an urgent need to evaluate how well these models recognize and articulate cultural diversity, especially when many are primarily trained on English-centric data.

Framework Overview

MakiEval steps beyond existing benchmarks, focusing on multilingual evaluation and avoiding pitfalls associated with translation errors. This framework leverages Wikidata's multilingual database to automatically identify cultural entities in textual outputs from LLMs and links these to structured knowledge, facilitating scalable and language-agnostic evaluations. This automated approach circumvents the need for manually curated datasets, which typically suffer from scalability issues and lack adaptability to diverse cultural nuances.

To quantify cultural awareness, the authors propose four distinct metrics: granularity, diversity, cultural specificity, and consensus. Granularity assesses the detail level in cultural references, diversity measures the range of cultural elements in model outputs, cultural specificity evaluates how well model outputs align with the cultural context given in prompts, and consensus quantifies the agreement in outputs across different languages.

Experimental Setup and Results

The framework has been extensively tested on seven LLMs, including both open-source and proprietary models, across 13 languages, 19 countries/regions, and six culturally significant topics such as food and music. The findings indicate discernible variations in cultural awareness across different languages, with English prompts generally eliciting more culturally grounded knowledge. This suggests a bias towards enhanced cultural articulation when models operate in English.

One of the key observations is the inherent language-specific baselines for cultural granularity, which remain stable across various cultural contexts. Furthermore, models tend to align better with culturally related languages and regions of origin. Notably, differences in cultural diversity and specificity are influenced by a combination of factors, including model architecture, the language of the prompt, and the country mentioned in the context.

Implications and Future Directions

The introduction of MakiEval marks a significant step towards understanding and improving the cultural competencies of LLMs—a domain that is increasingly critical given their widespread applications. By facilitating scalable multilingual evaluations, this framework provides a pathway to recognize and mitigate cultural biases inherent in LLMs.

Practically, the insights offered by this framework can inform the development of more culturally sensitive applications, potentially enhancing user trust and interaction quality across diverse cultural settings. Theoretically, MakiEval enriches the discourse on cultural representation in AI, offering quantifiable methods to dissect cultural awareness in LLMs systematically.

In the future, research could expand MakiEval's application to include low-resource languages and probe more abstract cultural concepts like values and ideologies. As AI models continue to evolve, frameworks like MakiEval will become indispensable tools for fostering inclusivity and cultural adaptability in machine learning. The release of the code and data ensures that other researchers can build upon these findings, paving the way for enhanced cultural empathy in AI technologies.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (4)

Tweets

https://twitter.com/WikiResearch/status/1928345408645865844

Reddit

MakiEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs (3 points, 0 comments)