SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (2309.07445v3)

Published 14 Sep 2023 in cs.CL

Abstract: Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of LLM setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual LLMs, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual LLMs on a more diverse set of languages. https://github.com/dadelani/sib-200

PDF Abstract

Insights into the SIB-200 Dataset for Multilingual Topic Classification

The paper "SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects" introduces a comprehensive dataset aimed at addressing the paucity of evaluation resources for Natural Language Understanding (NLU) across a diverse linguistic spectrum. SIB-200 provides an essential framework for evaluating multilingual pre-trained LLMs (PLMs) by classifying topics within a corpus covering over 205 languages and dialects.

Dataset Development and Characteristics

The SIB-200 dataset is derived from the Flores-200 machine translation corpus, which includes parallel texts in 205 languages. The English portion of the dataset was annotated with topic labels, and this annotation was extended across all languages through the parallel structure. The dataset spans several domains, covering topics such as science/technology, travel, politics, sports, health, entertainment, and geography.

This extensive coverage marks a significant effort towards the inclusion of low-resource languages in NLU tasks. However, the dataset predominantly consists of machine-translated texts, which may introduce translationese effects, potentially impacting model performance due to inherent biases in translated text.

Evaluation and Experimental Findings

The evaluation of multilingual PLMs using SIB-200 highlights several challenges and opportunities:

Performance Disparities: There is a notable performance gap between high-resource languages and their low-resource counterparts. Particularly, languages not seen during pre-training, from under-represented families like Nilotic and Atlantic-Congo, and from regions such as Africa, America, and Oceania, demonstrate lower classification accuracy.
Model Performance: Massively multilingual models such as XLM-R and Glot-500 were evaluated. XLM-R demonstrated superior performance relative to Glot-500, which underperformed possibly due to domain mismatch between its religious training corpus and the diverse news topics in SIB-200.
Effectiveness of Multilingual Adaptive Fine-Tuning (MAFT): MAFT demonstrated potential in bridging the performance gap, particularly for African languages, where fine-tuning with synthetic and less extensive monolingual data improved results significantly, by up to 5% in accuracy.
Robustness of Models: Region-specific LLMs such as IndicBERTv2 and AfroXLMR showed enhanced performance for their respective language groups, underlying the efficacy of geographically and linguistically tailored model pre-training.

Implications and Future Directions

The creation of SIB-200 allows for a more inclusive evaluation of multilingual LLMs. The paper underscores the critical need for expanding model training coverage to encompass unseen languages and scripts. It also reveals the limitations of zero-shot learning approaches in LLMs when applied to minority languages. The findings advocate for further research into domain-general pre-training techniques that incorporate diverse linguistic features and topic variations.

The SIB-200 dataset provides a foundation for understanding the inherent challenges in multilingual NLU tasks and emphasizes the need for more inclusive datasets. By facilitating improved evaluations across a broad array of low-resource languages, the work aims to bridge the current language performance divide in NLP applications, paving the way for more equitable language technology development across the globe.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

David Ifeoluwa Adelani (59 papers)
Hannah Liu (1 paper)
Xiaoyu Shen (73 papers)
Nikita Vassilyev (1 paper)
Jesujoba O. Alabi (20 papers)
Yanke Mao (2 papers)
Haonan Gao (1 paper)
Annie En-Shiun Lee (2 papers)

Citations (39)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - dadelani/sib-200: SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (21 stars)

Tweets

https://twitter.com/davlanade/status/1747932243698290945