MasakhaNER: African NER Benchmark for Low-Res Languages
- MasakhaNER is a comprehensive suite of datasets and benchmarks for NER in African languages, emphasizing both monolingual and cross-lingual evaluation.
- It features detailed annotation protocols with high inter-annotator agreement (Fleiss’ κ ≥ 0.93) using newswire corpora for diverse languages.
- The suite benchmarks transfer learning techniques, revealing significant gains from Africa-centric models and multi-source co-training in low-resource settings.
MasakhaNER denotes a set of publicly released human-annotated datasets, benchmarks, and accompanying experiments for named entity recognition (NER) in African languages. Addressing the severe underrepresentation of African languages in computational linguistics, MasakhaNER supports both monolingual and cross-lingual NER research across a typologically, genealogically, and geographically diverse set of languages. The project comprises two major dataset releases (MasakhaNER 1.0 and MasakhaNER 2.0), detailed annotation protocols, and extensive empirical evaluations targeting supervised and transfer learning settings. MasakhaNER has also become a de facto NER evaluation suite for scalable, multilingual, pre-trained LLMs, including work addressing vocabulary bottlenecks in low-resource settings (Adelani et al., 2022, Adelani et al., 2021, Liang et al., 2023).
1. Dataset Construction, Annotation Protocols, and Linguistic Scope
MasakhaNER 1.0 features ten languages—Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian Pidgin, Swahili, Wolof, and Yorùbá—selected for linguistic diversity and coverage. MasakhaNER 2.0 expands to twenty languages, including additional Bantu languages (isiZulu, isiXhosa, chiShona, Setswana, Chichewa), a Nilo-Saharan language (Luo), an Afro-Asiatic language (Hausa), and an English-lexifier creole (Naija). Supported scripts include Latin (with language-specific extensions), Arabic (Hausa), and hybrid N’Ko/Roman (Bambara).
Data sourcing primarily derives from African newswire corpora and translations of news stories in low-resource scenarios. Annotation utilizes the ELISA tool and MUC-6 guidelines, designating four entity categories: PERSON (PER), LOCATION (LOC), ORGANIZATION (ORG), and DATE/TIME (DATE). Quality control involves three native-speaking annotators per language, a language coordinator, and adjudication of disagreements, achieving post-coordination Fleiss’ κ ≥ 0.93 for all languages (Adelani et al., 2022).
Entity density in the corpora ranges from approximately 6% to 16% of tokens, with each language’s train/dev/test split conforming to a 70/10/20 ratio and yielding between 4.3K–7.8K training sentences and 120K–345K tokens per language.
2. Linguistic Phenomena and NER Challenges
MasakhaNER’s coverage exposes several challenges inherent to African language NER:
- Script and Orthographic Diversity: Amharic’s Fidel script lacks case distinction, Western Bantu languages capitalize roots post-prefix, and Nigerian Pidgin has no standardized spelling.
- Diacritics and Tonal Marking: Languages such as Yorùbá and Igbo utilize diacritics and tone marks, many of which are omitted in casual writing or inconsistently rendered in digital corpora.
- Agglutination and Morphological Richness: Agglutinative morphology observed in languages like Igbo and Wolof leads to a high rate of out-of-vocabulary (OOV) tokens for standard tokenizers.
- Lexical Divergence and Entity Variation: Entities such as "Nigeria" may appear with multiple spellings across or within languages, and numerals are frequently rendered in non-standard ways.
- Domain and Register Variation: Source text heterogeneity (online news vs. Wikipedia) complicates entity normalization and model generalization (Adelani et al., 2021, Adelani et al., 2022).
3. NER Modeling, Cross-Lingual Transfer, and Evaluation
MasakhaNER benchmarks make use of several state-of-the-art pre-trained LLMs (PLMs):
- General Multilingual PLMs: mBERT, XLM-R (base and large), RemBERT, mDeBERTaV3.
- Africa-centric PLMs: AfriBERTa (pre-trained on 11 African languages), AfroXLM-R (language-adapted XLM-R on 17 African languages plus English, French, Arabic).
- Classical Architectures: CNN-BiLSTM-CRF, MeanE-BiLSTM.
Fine-tuning employs sequence lengths up to 200, batch sizes of 16–32, and learning rates of , using HuggingFace Transformers and AdamW (Adelani et al., 2022, Adelani et al., 2021). Standard evaluation adopts the micro-averaged score:
with entity-level exact match required.
Monolingual baselines indicate that AfroXLM-R-large achieves the highest average F₁ (87.0), outperforming XLM-R-large (85.1) and mDeBERTaV3 (85.7). Language-adaptive fine-tuning consistently yields improvements, especially for under-resourced scripts and morphologies.
4. Cross-Lingual and Few-Shot Transfer: Patterns and Predictors
Cross-lingual transfer—zero-shot or few-shot—is a central focus. Zero-shot transfer fine-tunes a PLM on a source language and directly evaluates on a target. Across 20 languages, the best source language is typically geographically or typologically proximate to the target; transferring from English yields considerably lower average zero-shot F₁ (56.9) compared to African source languages (average +14 F₁ improvement by choosing the optimal African source).
A heatmap analysis demonstrates that specific pairs (e.g., Shona→Xhosa, Yoruba→Igbo, Swahili→Kinyarwanda) yield optimal transfer. Two-source co-training further raises zero-shot F₁ by 2–4 points; gains of 3–7 F₁ are typical beyond the best single-source baseline.
Few-shot transfer is highly sample-efficient. Zero-shot transferred models fine-tuned on only 100 or 500 target language sentences reach F₁ scores of ≈80–90, vastly outperforming English-based transfer and pure few-shot baselines. This underlines that targeted transfer followed by minimal in-language tuning is highly effective even in low-resource settings.
Key predictors of transfer success, formalized via an adaptation of LangRank, are:
- Geographic distance (): Overlap in personally, locally, or organizationally named entities increases with proximity.
- Entity-type token overlap (): Fraction of unique entity tokens shared between source and target training sets (Spearman with transfer F₁). Other features such as genetic, phonological, or syntactic distances, as well as training set size, show lesser and inconsistent predictive utility for NER (Adelani et al., 2022).
5. Impact and Benchmarking with Large Multilingual Models
MasakhaNER data establishes itself as the principal NER benchmark for African language support in scalable PLMs. XLM-V—the first large multilingual LLM with a 1 million token vocabulary—uses MasakhaNER for low-resource evaluation (Liang et al., 2023). XLM-V’s token allocation strategy assigns explicit token budgets via cluster-specific SentencePiece vocabularies, dramatically reducing over-tokenization and OOV rates compared to XLM-R’s 250K shared vocabulary. On zero-shot English-to-African cross-lingual NER, XLM-V outperforms XLM-R by +11.2 F₁ (32.1 vs. 20.9 average), achieving especially large increases on agglutinative languages (e.g., Igbo: 11.6→45.9, Luganda: 9.5→48.7, Yoruba: 10.0→35.8). This demonstrates that resource allocation at the subword level is critical for African language NER benchmarks (Liang et al., 2023).
| Language | XLM-R (250K) F₁ | XLM-V (1M) F₁ |
|---|---|---|
| Amharic | 25.1 | 20.6 |
| Hausa | 43.5 | 35.9 |
| Igbo | 11.6 | 45.9 |
| Kinyarwanda | 9.4 | 25.0 |
| Luganda | 9.5 | 48.7 |
| Luo | 8.4 | 10.4 |
| NigerianPidgin | 36.8 | 38.2 |
| Swahili | 48.9 | 44.0 |
| Wolof | 5.3 | 16.7 |
| Yoruba | 10.0 | 35.8 |
6. Recommendations, Limitations, and Future Directions
Best practices established by MasakhaNER research are:
- Prefer Africa-centric transfer source languages over defaulting to English.
- Utilize heuristics or models like LangRank to select the most appropriate source language(s).
- Even minimal in-language annotation (100–500 sentences) post-transfer significantly enhances downstream performance.
- Two-source co-training is frequently advantageous.
- Domain and register coverage expansion, rapid annotation pipelines, and inclusion of social media or legal texts are vital for future datasets.
- Explore multi-source transfer, adapter/prompt-based methods, continual learning, and social-linguistic network features as predictors of transferability.
The MasakhaNER suite and its extensions provide a rigorous foundation for Africa-centric NLP research, revealing typological and geographic alignment as primary drivers of transfer efficiency and, by extension, advancing the study and application of NER in truly low-resource languages (Adelani et al., 2022, Adelani et al., 2021, Liang et al., 2023).