- The paper critically examines over 2,000 multilingual NLP benchmarks, revealing significant imbalances favoring English and high-resource languages.
- Analysis shows current benchmarks suffer from language imbalance, reliance on translation, skewed task distribution, and poor correlation with human judgments for non-localized content.
- The authors propose developing future benchmarks that are accurate, challenging, practical, diverse, and culturally authentic, while suggesting research directions like focusing on low-resource languages and localized evaluation.
The Challenges and Future Directions of Multilingual Benchmarks in NLP
The paper entitled "The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks" offers a comprehensive examination of multilingual evaluation practices for LLMs. The central argument is that despite considerable financial investments in this field, current multilingual benchmarking systems remain largely skewed towards English and high-resource languages (HRLs), thereby failing to adequately represent linguistic diversity. This paper provides both a critique of existing approaches and suggestions for future improvements.
Analysis of Current Practices
The authors of this paper conducted an exhaustive evaluation of over 2,000 multilingual benchmarks originating from 148 countries, covering the years from 2021 to 2024. They identified several key shortcomings in current multilingual benchmarks:
- Language Imbalance: Although English-only benchmarks were excluded from the data collection process used in this paper, English remains the most frequently represented language. This overrepresentation is followed by other HRLs such as Chinese, Spanish, French, and German, leaving low-resource languages (LRLs) significantly underrepresented.
- Translation Issues: A majority of benchmarks are based on original language texts rather than translations, and a notable proportion rely on machine translation tools like Google Translate and GPT series models. The paper highlights that translation methodologies alone are insufficient for capturing the cultural and linguistic nuances necessary for effective evaluation.
- Task and Domain Distribution: Discriminative tasks dominate the benchmarks, constituting 66.5%, compared to 23.5% for generative tasks. This reflects an imbalance that does not cater adequately to emerging applications of NLP. Additionally, domains such as news and social media are overrepresented, while areas like healthcare and law are insufficiently covered.
- Institutional Contributions: The G5 countries (China, India, Germany, UK, and USA) dominate the creation of multilingual benchmarks. Most efforts originate from academic institutions, illustrating a gap between research and practical application.
Correlation with Human Judgments
A pivotal aspect of this paper is its analysis of the correlation between benchmark performance and human judgments. The findings indicate that:
- STEM-related Tasks: Tasks focusing on commonsense reasoning and scientific knowledge (ARC and MGSM) show strong correlations with human judgments across multiple languages.
- Localized Benchmarks: Benchmarks specifically tailored to a culture or language (e.g., CMMLU for Chinese) exhibit higher alignment with human assessments than translated ones.
- Cross-Linguistic Differences: The correlation strengths for similar benchmarks vary widely across languages, underscoring the need for authentic, culturally relevant benchmarking.
Recommendations for Future Multilingual Benchmarks
In addressing the identified gaps, the paper proposes several important characteristics for developing successful multilingual benchmarks:
- Accurate and Contamination-Free: Benchmarks must contain precise annotations and be free from data contamination to ensure genuine evaluations of model capabilities.
- Challenging and Practically Relevant: To drive meaningful advancements, benchmarks should challenge current models’ capabilities and reflect real-world usage scenarios.
- Diverse and Culturally Authentic: Effective multilingual benchmarking must ensure diversity in languages and cultural contexts.
Proposed Research Directions
Given the limitations of current practices, the authors suggest several avenues for future research:
- Enhancing NLG Tasks: Balance the focus between discriminative and generative tasks in multilingual settings.
- Focusing on LRLs: Develop specialized benchmarks for LRLs to break the cycle of underrepresentation and poor performance.
- Localized Benchmarks: Develop culturally and linguistically specific benchmarks that better correlate with human judgments.
- Leveraging LLMs as Evaluators: Use LLMs for scalable, cross-linguistic evaluation, while addressing inherent biases.
Conclusion
The paper concludes with a call to action for the NLP community to collaborate globally in creating multilingual benchmarks that better reflect human judgments and real-world applications. The pursuit of equitable and comprehensive multilingual evaluation is vital for ensuring that language technologies are inclusive and effective for all users. Such collaborations could pave the way for a new generation of benchmarks that not only enhance the evaluation landscape but also guide the development of multilingual language technologies.