MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

Published 22 Oct 2022 in cs.CL | (2210.12391v2)

Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.

Abstract PDF Upgrade to Chat

Authors (45)

First 10 authors:

Citations (41)

View on Semantic Scholar

Summary

The paper introduces the largest annotated NER dataset for 20 African languages, enabling robust cross-lingual transfer and zero-shot learning improvements.
It demonstrates that models like AfroXLM-R, pre-trained on African corpora, deliver significant performance gains, with up to a 14 F1 point increase in zero-shot settings.
It reveals that using typologically and geographically similar African languages as source languages greatly enhances transfer learning efficiency for low-resource settings.

Africa-centric Transfer Learning for Named Entity Recognition: MasakhaNER 2.0

Introduction

The MasakhaNER 2.0 paper presents an expansive effort to address the longstanding challenge of underrepresented African languages in NLP research, specifically focusing on Named Entity Recognition (NER). The effort culminated in the creation of the largest annotated NER dataset across 20 African languages, collectively known as MasakhaNER 2.0. These languages cover a diverse range of linguistic properties and geographic distributions, which makes this compilation pivotal as a benchmark for evaluating cross-lingual transfer methods in NLP.

Corpus Creation and Characteristics

To fill the existing gaps, the research team created a dataset encompassing 20 languages, expanding the scope from the previous MasakhaNER 1.0 that included only ten languages. The expansion notably covers Southern African languages, which exhibit unique syntactic and morphological features not present in the MasakhaNER 1.0 corpus. Each language dataset consists of annotated news articles sourced from various monolingual news domains or through translations to ensure linguistic diversity and rich named entity annotations (Figure 1).

Figure 1: Top-10 countries with the largest number of entities in MasakhaNER 2.0.

A meticulous annotation process was adopted, involving native speakers to ensure high-quality named entity annotations across four categories: PER (Personal names), LOC (Locations), ORG (Organizations), and DATE (Date and Time). The research also emphasized ensuring linguistic richness by preserving distinct syntactic features like tonality and noun prefixes, which are predominantly found in African languages.

Evaluation and Baseline Models

The paper evaluates the efficacy of several multilingual PLMs by fine-tuning them for NER tasks across the MasakhaNER 2.0 corpus. Baseline experiments show that models like AfroXLM-R, which are pre-trained on African language corpora, demonstrate enhanced performance (Figure 2). This results in a noteworthy increase in F1 scores, particularly when models are fine-tuned on languages typologically similar to the target language.

Figure 2: Sample Efficiency Results for 100 and 500 samples in the target language, model fine-tuned from a PLM (e.g. FT-100 -- trained on 100 samples from the target language) or fine-tuned from the best transfer language NER model (e.g. BT-Lang-0 -- trained on 0 samples from the target language or zero-shot).

The evaluation results substantiate that AfroXLM-R-large, with a significant F1 score increase, outperformed other multilingual PLMs, marking a significant advancement in zero-shot learning approaches for African languages.

Cross-Lingual Transfer Analysis

A substantial part of the research focused on analyzing cross-lingual transfer effectiveness, primarily in zero-shot settings. Given the typological and geographical variability among African languages, selecting the optimal source language for transfer learning is essential. The experiments reveal that geographic distance and entity overlap significantly influence transfer efficacy, as amplified through the use of LangRank models that integrate these factors to predict the best source languages for transfer (Figure 3).

Figure 3: The correlation between the data overlap and F1 transfer performance.

MasakhaNER 2.0 illustrates that opting for a typologically and geographically proximate language yields significant improvements in NER model transfer performance for low-resource African languages. Specifically, the study observed that while English is often used as a source language, its divergence from African languages often results in suboptimal transfer efficiency. Instead, using another African language, when feasible, typically improves the zero-shot transfer results by around 14 F1 points, showcasing the advantage of Africa-centric source languages.

Conclusion

MasakhaNER 2.0 serves as a crucial resource for advancing Africa-centric NLP research, addressing a critical gap in annotated datasets for African languages. With comprehensive coverage across 20 languages and strong baseline results, it offers a reliable benchmark for evaluating and improving NER models. The insights into cross-lingual transfer highlight the necessity to leverage typological and geographical similarities for optimizing transfer learning in heterogeneous linguistic settings. These findings lay a foundational framework for future work aiming at extending multilingual models and algorithms to better support the linguistic diversity of African languages.

Markdown Report Issue