Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages (2504.10356v2)

Published 14 Apr 2025 in cs.CL

Abstract: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

Summary

MultiLoKo: A Multilingual Local Knowledge Benchmark for Evaluating LLMs

In this paper, the researchers introduce MultiLoKo, a novel benchmark aimed at assessing the multilingual capabilities of LLMs across 31 languages. Existing multilingual benchmarks often fail to capture local relevance, as they predominantly translate from English. MultiLoKo addresses this limitation by incorporating locally-sourced data, emphasizing local knowledge in diverse languages.

Overview and Design of MultiLoKo

MultiLoKo consists of 500 locally-sourced questions per language. These are intended to probe the knowledge that is salient in each cultural and linguistic context, distinguishing itself from other benchmarks that impose an English-centric perspective. The importance of this benchmark lies in its potential to offer more precise insights into the proficiency of LLMs in understanding and generating text in a multitude of languages.

The benchmark is partitioned into three sets: locally-relevant questions, questions translated from the local language to English, and vice versa. Additionally, machine translations are provided alongside human ones for comparative analysis. This setup enables rigorous evaluation of multilingual capabilities, knowledge transfer, and the impact of translation methods on the benchmark scores.

Evaluation and Findings

Eleven models marketed as multilingual were evaluated using MultiLoKo, ranging from base to chat models. The results highlight several challenges in achieving true multilingual proficiency. Notably, no model demonstrated parity across languages, with significant gaps between the highest and lowest scores observed in models such as Gemeni 2.0 Flash, Llama 3.1, and GPT4-o.

Strong Numerical Results and Claims

  • Model Scores: The highest performing model, Gemini 2.0 Flash, achieved an average score of 34.4 points, with a gap of around 35 points between its best and worst language performance. Llama 3.1 and GPT4-o had slightly lower average scores but demonstrated even higher discrepancies across languages.
  • Mother Tongue Effect (MTE): There was a significant MTE, where models performed better when questions were posed in the native language. This underscores an issue of inefficient cross-language knowledge transfer—a notable factor in multilingual evaluation.
  • Translation Impact: Human translations generally produced higher scores than machine translations, emphasizing the limitations of current machine translation technology in fully capturing nuances and maintaining consistency in multilingual contexts.

Implications and Future Developments

The paper accentuates the urgent need for models with improved handling of locally-relevant content, and better transfer abilities between languages. MultiLoKo's approach of locally sourcing content in multiple languages lays bare the deficiencies in current LLM capabilities and highlights the need for development beyond mere translation. This benchmark fosters crucial insights for researchers and developers aiming to bridge the gap in language parity and refine multilingual models.

Future research could explore integration of locally-relevant content not just for evaluation, but also as part of LLM training data to enhance cross-language understanding. Additionally, exploring the use of non-traditional sources or adapting existing monolingual resources may prove advantageous.

Conclusion

MultiLoKo represents an integral step forward in multilingual evaluation, compelling the research community to rethink evaluation strategies for LLMs. By prioritizing local relevance and establishing a robust framework for measuring multilingual proficiency, this benchmark promises to contribute significantly to both theoretical advancements and practical applications in AI language technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com