AfroXLMR-Social: African NLP for Social Media
- AfroXLMR-Social is a multilingual encoder-based model tailored for African social media and news domains, enhancing tasks like sentiment analysis, emotion classification, and hate speech detection.
- It employs continual pre-training using Domain-Adaptive Pre-Training (DAPT) on the expansive AfriSocial corpus, with optional Task-Adaptive Pre-Training (TAPT) for refined task-specific performance.
- The model achieves significant performance boosts (+1% to +30% Macro-F1 per language) over leading instruction-tuned LLMs, proving effective in low-resource African language settings.
AfroXLMR-Social is a multilingual encoder-based LLM specialized for African languages in the social media and news domains. It extends AfroXLMR, which itself builds on XLM-RoBERTa-Large, using continual pre-training techniques—Domain-Adaptive Pre-Training (DAPT) and Task-Adaptive Pre-Training (TAPT)—applied to a newly constructed corpus, AfriSocial. AfroXLMR-Social demonstrates substantial performance gain (+1% to +30% Macro-F1 per language, average +4–7 F1) on subjective NLP tasks including sentiment analysis, emotion classification, and hate speech detection across 19 African languages, outperforming prominent instruction-tuned LLMs in these specialized low-resource settings (Belay et al., 24 Mar 2025).
1. Model Structure and Pre-training Objectives
AfroXLMR-Social maintains the architecture of XLM-RoBERTa-Large: 24 transformer layers, hidden size 1024, feed-forward size 4096, 16 self-attention heads, and a vocabulary of approximately 250,000 tokens derived via SentencePiece. AfroXLMR initializes parameters from XLM-R and conducts additional language-adaptive pre-training (LAPT) on a multilingual African corpus spanning 76 languages. AfroXLMR-Social further adapts the model through DAPT using AfriSocial data, with optional subsequent TAPT on unlabeled data for task-specific adaptation.
Pre-training employs the masked LLM (MLM) objective. For input token sequence , a random masking set is selected and the loss is: DAPT minimization is performed over : TAPT employs the same loss over unlabeled data for each downstream task : No architecture changes or adapter modules are introduced; continual pre-training relies solely on new domain/task data.
2. AfriSocial Corpus Design and Properties
AfriSocial is a large-scale unlabeled corpus combining social media (X, formerly Twitter) and news texts across 14 African languages. Collection emphasizes alignment with the data sources and language distributions of the evaluation tasks (sentiment analysis, emotion classification, hate speech detection).
Data selection and pre-processing involve:
- Sentence-level language identification using pycld3 (Latin scripts) and GeezSwitch (Ethiopic).
- Sentence segmentation: NLTK (Latin), amseg (Ethiopic).
- Removal of user handles, URLs, short sentences (<3 tokens), and de-duplication with evaluation splits.
The corpus comprises 1.82M social media and 1.74M news sentences (51%/49%). The table below lists language-specific sentence counts.
$\begin{array}{lrrr} \hline \text{Lang.} & \#\text{X‐sent.} & \#\text{News‐sent.} & \text{Total Sent.} \ \hline amh & 588,\!154 & 45,\!480 & 633,\!634 \ arq & 9,\!219 & 156,\!494 & 165,\!712 \ hau & 640,\!737 & 30,\!935 & 671,\!672 \ ibo & 15,\!436 & 38,\!231 & 53,\!667 \ kin & 16,\!928 & 72,\!583 & 89,\!511 \ orm & 33,\!587 & 59,\!429 & 93,\!016 \ pcm & 106,\!577 & 7,\!781 & 116,\!358 \ som & 144,\!862 & 24,\!473 & 169,\!335 \ swa & 46,\!588 & — & 46,\!834 \ tir & 167,\!139 & 45,\!033 & 212,\!172 \ twi & 8,\!681 & — & 8,\!681 \ yor & 26,\!560 & 49,\!591 & 76,\!151 \ xho & — & 354,\!959 & 354,\!959 \ zul & 12,\!102 & 854,\!587 & 866,\!689 \ \hline \text{Total} & 1.82\text{M} & 1.74\text{M} & 3.56\text{M} \ \hline \end{array}$
3. Continual Pre-training Procedures and Parameters
Domain-Adaptive Pre-Training (DAPT)
- Data: Full AfriSocial (≈3.56M sentences).
- MLM objective as described above.
- Training parameters: batch size 8, learning rate , approximately 10 epochs, AdamW optimizer (weight decay 0.01), SentencePiece tokenizer. Training duration is approximately 3 days on 3 GPUs.
Task-Adaptive Pre-Training (TAPT)
- For each task (AfriSenti, AfriEmo, AfriHate), MLM pre-training is performed with unlabeled train split .
- Parameters: batch size 8, learning rate , 3–5 epochs (<1 hour on 1 GPU).
Combining DAPT and TAPT
- Sequential procedure: initialize from AfroXLMR, apply DAPT (AfriSocial), then TAPT ().
- This two-phase update generally yields strongest downstream results but can result in catastrophic forgetting of domain knowledge if overfit on small TAPT splits.
4. Downstream Tasks, Datasets, and Evaluation Metrics
AfroXLMR-Social is evaluated on three social-domain tasks for African languages, with Macro-F1 as the principal metric:
Task summary:
| Task | Languages | Categories | Instances | Data Sources |
|---|---|---|---|---|
| AfriSenti | 14 | positive, neutral, negative | 107,694 | X (Twitter), news |
| AfriEmo | 17 | 6 emotions (multi-label) | 70,859 | X, Reddit, YouTube, news |
| AfriHate | 15 | abuse, hate, neutral | 90,455 | X |
All evaluations are performed on original train/dev/test splits as defined by the source datasets.
5. Empirical Performance and Comparative Results
Performance improvements due to DAPT, TAPT, and their combination are reported per-task and per-language.
DAPT Results
Average F1 improvements (Table 2 excerpt):
$\begin{array}{lccc} \hline \text{Dataset} & \text{Base (AfroXLMR)} & \text{+DAPT} & \Delta\text{F1} \ \hline \mathrm{AfriSenti} & 51.39 & 58.21 & +6.82 \ \mathrm{AfriEmo} & 44.25 & 51.45 & +7.20 \ \mathrm{AfriHate} & 65.17 & 69.83 & +4.66 \ \hline \end{array}$
Per-language F1 gains range from +1% to +30%, with low-resource languages showing the largest improvements.
TAPT and Cross-Task Transfer
Using unlabeled sibling task data (Table 3 excerpt):
$\begin{array}{lcccc} \hline \text{Eval Task} & \text{Base} & \text{TAPT from Sibling} & \text{DAPT+TAPT} \ \hline \mathrm{AfriSenti} & 52.62 & 57.09\,(\mathrm{Emo}),\,57.34\,(\mathrm{Hate}) & 59.35 \ \mathrm{AfriEmo} & 45.74 & 49.22\,(\mathrm{Senti}),\,49.22\,(\mathrm{Hate}) & 52.87 \ \mathrm{AfriHate} & 66.72 & 67.46\,(\mathrm{Senti}),\,67.14\,(\mathrm{Emo}) & 67.77 \ \hline \end{array}$
Fine-tuning AfroXLMR on AfriEmo unlabeled data (TAPT) yields +5.65 F1 on AfriSenti; combined DAPT+TAPT increases this to +8.7.
DAPT+TAPT Detailed Comparison
Best combinations per task (Table 4 excerpt):
$\begin{array}{lcc} \hline \text{Task} & \text{Best Single‐Step} & \text{Best DAPT+TAPT} \ \hline \mathrm{AfriSenti} & +\mathrm{DAPT} (56.85) & +\mathrm{DAPT}+\,\mathrm{TAPT}_{\mathrm{Emo}} (57.73) \ \mathrm{AfriEmo} & +\mathrm{DAPT} (51.48) & +\mathrm{DAPT}+\,\mathrm{TAPT}_{\mathrm{Senti}} (49.84)\,† \ \mathrm{AfriHate} & +\mathrm{DAPT} (70.56) & +\mathrm{DAPT}+\,\mathrm{TAPT}_{\mathrm{Emo}} (67.18)\,† \ \hline \end{array}$
† TAPT or DAPT alone occasionally matches or exceeds the combined step, reflecting forgetting effects.
Comparison with LLMs
Fine-tuned AfroXLMR-Social is competitive or superior to 7–70B parameter instruction-tuned LLMs (Llama-3, Gemma, GPT-4o) on these African language tasks, underlining the effectiveness of encoder-only models tailored via continual pre-training.
6. Analysis, Error Patterns, and Future Directions
Continual domain and task adaptation—via DAPT and TAPT—yields consistent F1 improvements, substantiating the value of focused pre-training even for strong base multilingual models. Cross-task TAPT with unlabeled sibling data provides further gains (+0.6% to +15.1% F1), particularly where evaluation and adaptation tasks are closely related (e.g., sentiment for emotion).
A notable limitation is “catastrophic forgetting” during TAPT on small data, which can partially overwrite broader domain knowledge gained via DAPT. This suggests future work may benefit from improvements in blending domain/task data or the use of adapter-based architectures.
Encoder-only models like AfroXLMR-Social retain state-of-the-art status on low-resource, domain-specific African NLP tasks compared to zero-shot LLMs, particularly when fine-tuned with relevant local data. Residual errors are concentrated on code-switched expressions, highly dialectal forms, sarcasm, and ambiguous emotional contexts—points of difficulty that may respond to expanded corpora or multi-task objectives.
7. Summary and Resources
AfroXLMR-Social establishes a strong paradigm for African language NLP: continual domain and task adaptation, without architectural changes, efficiently leverages social/news sentences (3.5M in AfriSocial) to augment model capability for three major subjective tasks across 14–17 languages. The approach delivers +4–7 Macro-F1 improvements and, via cross-task TAPT, can further boost performance. The model surpasses or matches advanced instruction-tuned LLMs in this context.
All data, models, and code are publicly accessible via HuggingFace: https://huggingface.co/tadesse/AfroXLMR-Social