MultiLoKo: A Multilingual Local Knowledge Benchmark for Evaluating LLMs
In this paper, the researchers introduce MultiLoKo, a novel benchmark aimed at assessing the multilingual capabilities of LLMs across 31 languages. Existing multilingual benchmarks often fail to capture local relevance, as they predominantly translate from English. MultiLoKo addresses this limitation by incorporating locally-sourced data, emphasizing local knowledge in diverse languages.
Overview and Design of MultiLoKo
MultiLoKo consists of 500 locally-sourced questions per language. These are intended to probe the knowledge that is salient in each cultural and linguistic context, distinguishing itself from other benchmarks that impose an English-centric perspective. The importance of this benchmark lies in its potential to offer more precise insights into the proficiency of LLMs in understanding and generating text in a multitude of languages.
The benchmark is partitioned into three sets: locally-relevant questions, questions translated from the local language to English, and vice versa. Additionally, machine translations are provided alongside human ones for comparative analysis. This setup enables rigorous evaluation of multilingual capabilities, knowledge transfer, and the impact of translation methods on the benchmark scores.
Evaluation and Findings
Eleven models marketed as multilingual were evaluated using MultiLoKo, ranging from base to chat models. The results highlight several challenges in achieving true multilingual proficiency. Notably, no model demonstrated parity across languages, with significant gaps between the highest and lowest scores observed in models such as Gemeni 2.0 Flash, Llama 3.1, and GPT4-o.
Strong Numerical Results and Claims
- Model Scores: The highest performing model, Gemini 2.0 Flash, achieved an average score of 34.4 points, with a gap of around 35 points between its best and worst language performance. Llama 3.1 and GPT4-o had slightly lower average scores but demonstrated even higher discrepancies across languages.
- Mother Tongue Effect (MTE): There was a significant MTE, where models performed better when questions were posed in the native language. This underscores an issue of inefficient cross-language knowledge transfer—a notable factor in multilingual evaluation.
- Translation Impact: Human translations generally produced higher scores than machine translations, emphasizing the limitations of current machine translation technology in fully capturing nuances and maintaining consistency in multilingual contexts.
Implications and Future Developments
The paper accentuates the urgent need for models with improved handling of locally-relevant content, and better transfer abilities between languages. MultiLoKo's approach of locally sourcing content in multiple languages lays bare the deficiencies in current LLM capabilities and highlights the need for development beyond mere translation. This benchmark fosters crucial insights for researchers and developers aiming to bridge the gap in language parity and refine multilingual models.
Future research could explore integration of locally-relevant content not just for evaluation, but also as part of LLM training data to enhance cross-language understanding. Additionally, exploring the use of non-traditional sources or adapting existing monolingual resources may prove advantageous.
Conclusion
MultiLoKo represents an integral step forward in multilingual evaluation, compelling the research community to rethink evaluation strategies for LLMs. By prioritizing local relevance and establishing a robust framework for measuring multilingual proficiency, this benchmark promises to contribute significantly to both theoretical advancements and practical applications in AI language technologies.