Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants (2308.16884v2)

Published 31 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art LLMs. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked LLMs (MLMs) and LLMs. We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

An Analytical Overview of the Belebele Benchmark

The paper presents Belebele, an extensive parallel reading comprehension dataset designed to evaluate natural language understanding (NLU) across 122 language variants. This dataset significantly enhances the ability to gauge multilingual capabilities of NLP models beyond traditional high-resource languages. By facilitating direct performance comparisons across languages, Belebele addresses a crucial gap in existing multilingual benchmarks.

Dataset Composition and Methodology

Belebele comprises 900 unique multiple-choice questions, each linked to distinct passages from the FLoRes-200 dataset. Each question offers four possible answers, providing a robust framework for evaluating text comprehension across varying resource levels of languages. The dataset's design, targeting high-, medium-, and low-resource languages, offers a paradigm for assessing multilingual masked LLMs (MLMs) and LLMs.

A significant strength of Belebele lies in its meticulous question curation process, which ensures that the questions differentiate between different levels of language comprehension without requiring extrinsic knowledge. The fully parallel nature of its linguistic content allows precise model performance evaluation across diverse languages, highlighting potential disparities and strengths.

Evaluation of Multilingual Models

Through exhaustive evaluations, Belebele uncovers notable insights into multilingual NLP systems. The paper finds that smaller MLMs pretrained on balanced multilingual datasets often exhibit superior comprehension across languages compared to English-centric LLMs like GPT-3 and Llama. This revelation underscores the vital role of balanced data in pretraining for achieving multi-language proficiency.

Interestingly, the implications of vocabulary size and construction are substantiated, showing a correlation with enhanced performance, particularly in low-resource languages. This insight into vocabulary dynamics could influence pretraining strategies, guiding more efficient multilingual model development.

Implications and Future Directions

Belebele opens new avenues for analyzing how LLMs handle language diversity and comprehension tasks. By providing a diverse linguistic benchmark, the dataset encourages further exploration into cross-lingual transfer, script variations, and the effectiveness of different pretraining regimes.

The research emphasizes the importance of extending LLM capabilities into less-studied languages, advocating for equitable NLP systems. The findings present foundational elements for advancing AI technologies that are inclusive of linguistic diversity.

In conclusion, Belebele represents a significant step in enhancing the evaluation breadth of NLP systems, offering crucial insights for both practical model development and theoretical exploration. Future investigations might focus on improving pretraining strategies to boost performance in low-resource languages, potentially leading to a new generation of more inclusive LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Lucas Bandarkar (6 papers)
  2. Davis Liang (15 papers)
  3. Benjamin Muller (20 papers)
  4. Mikel Artetxe (52 papers)
  5. Satya Narayan Shukla (17 papers)
  6. Donald Husa (1 paper)
  7. Naman Goyal (37 papers)
  8. Abhinandan Krishnan (6 papers)
  9. Luke Zettlemoyer (225 papers)
  10. Madian Khabsa (38 papers)
Citations (93)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets