AfroXLMR-Social: African NLP for Social Media

Updated 26 November 2025

AfroXLMR-Social is a multilingual encoder-based model tailored for African social media and news domains, enhancing tasks like sentiment analysis, emotion classification, and hate speech detection.
It employs continual pre-training using Domain-Adaptive Pre-Training (DAPT) on the expansive AfriSocial corpus, with optional Task-Adaptive Pre-Training (TAPT) for refined task-specific performance.
The model achieves significant performance boosts (+1% to +30% Macro-F1 per language) over leading instruction-tuned LLMs, proving effective in low-resource African language settings.

AfroXLMR-Social is a multilingual encoder-based LLM specialized for African languages in the social media and news domains. It extends AfroXLMR, which itself builds on XLM-RoBERTa-Large, using continual pre-training techniques—Domain-Adaptive Pre-Training (DAPT) and Task-Adaptive Pre-Training (TAPT)—applied to a newly constructed corpus, AfriSocial. AfroXLMR-Social demonstrates substantial performance gain (+1% to +30% Macro-F1 per language, average +4–7 F1) on subjective NLP tasks including sentiment analysis, emotion classification, and hate speech detection across 19 African languages, outperforming prominent instruction-tuned LLMs in these specialized low-resource settings (Belay et al., 24 Mar 2025).

1. Model Structure and Pre-training Objectives

AfroXLMR-Social maintains the architecture of XLM-RoBERTa-Large: 24 transformer layers, hidden size 1024, feed-forward size 4096, 16 self-attention heads, and a vocabulary of approximately 250,000 tokens derived via SentencePiece. AfroXLMR initializes parameters from XLM-R and conducts additional language-adaptive pre-training (LAPT) on a multilingual African corpus spanning 76 languages. AfroXLMR-Social further adapts the model through DAPT using AfriSocial data, with optional subsequent TAPT on unlabeled data for task-specific adaptation.

Pre-training employs the masked LLM (MLM) objective. For input token sequence $x = (x_1, \ldots, x_T)$ , a random masking set $M \subset \{ 1, \ldots, T \}$ is selected and the loss is: $\mathcal{L}_\mathrm{MLM}(\theta) = -\sum_{t \in M} \log P_\theta(x_t | x_{\neg M})$ DAPT minimization is performed over $x \sim \mathrm{AfriSocial}$ : $\min_{\theta}\;\mathbb{E}_{x\sim\mathrm{AfriSocial}}[\mathcal{L}_\mathrm{MLM}(\theta;x)]$ TAPT employs the same loss over unlabeled data $D_\tau$ for each downstream task $\tau$ : $\min_{\theta}\;\mathbb{E}_{x\sim D_\tau}[\mathcal{L}_\mathrm{MLM}(\theta;x)]$ No architecture changes or adapter modules are introduced; continual pre-training relies solely on new domain/task data.

2. AfriSocial Corpus Design and Properties

AfriSocial is a large-scale unlabeled corpus combining social media (X, formerly Twitter) and news texts across 14 African languages. Collection emphasizes alignment with the data sources and language distributions of the evaluation tasks (sentiment analysis, emotion classification, hate speech detection).

Data selection and pre-processing involve:

Sentence-level language identification using pycld3 (Latin scripts) and GeezSwitch (Ethiopic).
Sentence segmentation: NLTK (Latin), amseg (Ethiopic).
Removal of user handles, URLs, short sentences (<3 tokens), and de-duplication with evaluation splits.

The corpus comprises 1.82M social media and 1.74M news sentences (51%/49%). The table below lists language-specific sentence counts.

$\begin{array}{lrrr} \hline \text{Lang.} & \#\text{X‐sent.} & \#\text{News‐sent.} & \text{Total Sent.} \ \hline amh & 588,\!154 & 45,\!480 & 633,\!634 \ arq & 9,\!219 & 156,\!494 & 165,\!712 \ hau & 640,\!737 & 30,\!935 & 671,\!672 \ ibo & 15,\!436 & 38,\!231 & 53,\!667 \ kin & 16,\!928 & 72,\!583 & 89,\!511 \ orm & 33,\!587 & 59,\!429 & 93,\!016 \ pcm & 106,\!577 & 7,\!781 & 116,\!358 \ som & 144,\!862 & 24,\!473 & 169,\!335 \ swa & 46,\!588 & — & 46,\!834 \ tir & 167,\!139 & 45,\!033 & 212,\!172 \ twi & 8,\!681 & — & 8,\!681 \ yor & 26,\!560 & 49,\!591 & 76,\!151 \ xho & — & 354,\!959 & 354,\!959 \ zul & 12,\!102 & 854,\!587 & 866,\!689 \ \hline \text{Total} & 1.82\text{M} & 1.74\text{M} & 3.56\text{M} \ \hline \end{array}$

3. Continual Pre-training Procedures and Parameters

Domain-Adaptive Pre-Training (DAPT)

Data: Full AfriSocial (≈3.56M sentences).
MLM objective as described above.
Training parameters: batch size 8, learning rate $5 \times 10^{-5}$ , approximately 10 epochs, AdamW optimizer (weight decay 0.01), SentencePiece tokenizer. Training duration is approximately 3 days on 3 GPUs.

Task-Adaptive Pre-Training (TAPT)

For each task $\tau$ (AfriSenti, AfriEmo, AfriHate), MLM pre-training is performed with unlabeled train split $D_\tau$ .
Parameters: batch size 8, learning rate $5 \times 10^{-5}$ , 3–5 epochs (<1 hour on 1 GPU).

Combining DAPT and TAPT

Sequential procedure: initialize from AfroXLMR, apply DAPT (AfriSocial), then TAPT ( $D_\tau$ ).
This two-phase update generally yields strongest downstream results but can result in catastrophic forgetting of domain knowledge if overfit on small TAPT splits.

4. Downstream Tasks, Datasets, and Evaluation Metrics

AfroXLMR-Social is evaluated on three social-domain tasks for African languages, with Macro-F1 as the principal metric:

$\mathrm{F1} = 2\frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

Task summary:

Task	Languages	Categories	Instances	Data Sources
AfriSenti	14	positive, neutral, negative	107,694	X (Twitter), news
AfriEmo	17	6 emotions (multi-label)	70,859	X, Reddit, YouTube, news
AfriHate	15	abuse, hate, neutral	90,455	X

All evaluations are performed on original train/dev/test splits as defined by the source datasets.

5. Empirical Performance and Comparative Results

Performance improvements due to DAPT, TAPT, and their combination are reported per-task and per-language.

DAPT Results

Average F1 improvements (Table 2 excerpt):

$\begin{array}{lccc} \hline \text{Dataset} & \text{Base (AfroXLMR)} & \text{+DAPT} & \Delta\text{F1} \ \hline \mathrm{AfriSenti} & 51.39 & 58.21 & +6.82 \ \mathrm{AfriEmo} & 44.25 & 51.45 & +7.20 \ \mathrm{AfriHate} & 65.17 & 69.83 & +4.66 \ \hline \end{array}$

Per-language F1 gains range from +1% to +30%, with low-resource languages showing the largest improvements.

TAPT and Cross-Task Transfer

Using unlabeled sibling task data (Table 3 excerpt):

$\begin{array}{lcccc} \hline \text{Eval Task} & \text{Base} & \text{TAPT from Sibling} & \text{DAPT+TAPT} \ \hline \mathrm{AfriSenti} & 52.62 & 57.09\,(\mathrm{Emo}),\,57.34\,(\mathrm{Hate}) & 59.35 \ \mathrm{AfriEmo} & 45.74 & 49.22\,(\mathrm{Senti}),\,49.22\,(\mathrm{Hate}) & 52.87 \ \mathrm{AfriHate} & 66.72 & 67.46\,(\mathrm{Senti}),\,67.14\,(\mathrm{Emo}) & 67.77 \ \hline \end{array}$

Fine-tuning AfroXLMR on AfriEmo unlabeled data (TAPT) yields +5.65 F1 on AfriSenti; combined DAPT+TAPT increases this to +8.7.

DAPT+TAPT Detailed Comparison

Best combinations per task (Table 4 excerpt):

$\begin{array}{lcc} \hline \text{Task} & \text{Best Single‐Step} & \text{Best DAPT+TAPT} \ \hline \mathrm{AfriSenti} & +\mathrm{DAPT} (56.85) & +\mathrm{DAPT}+\,\mathrm{TAPT}_{\mathrm{Emo}} (57.73) \ \mathrm{AfriEmo} & +\mathrm{DAPT} (51.48) & +\mathrm{DAPT}+\,\mathrm{TAPT}_{\mathrm{Senti}} (49.84)\,† \ \mathrm{AfriHate} & +\mathrm{DAPT} (70.56) & +\mathrm{DAPT}+\,\mathrm{TAPT}_{\mathrm{Emo}} (67.18)\,† \ \hline \end{array}$

† TAPT or DAPT alone occasionally matches or exceeds the combined step, reflecting forgetting effects.

Comparison with LLMs

Fine-tuned AfroXLMR-Social is competitive or superior to 7–70B parameter instruction-tuned LLMs (Llama-3, Gemma, GPT-4o) on these African language tasks, underlining the effectiveness of encoder-only models tailored via continual pre-training.

6. Analysis, Error Patterns, and Future Directions

Continual domain and task adaptation—via DAPT and TAPT—yields consistent F1 improvements, substantiating the value of focused pre-training even for strong base multilingual models. Cross-task TAPT with unlabeled sibling data provides further gains (+0.6% to +15.1% F1), particularly where evaluation and adaptation tasks are closely related (e.g., sentiment for emotion).

A notable limitation is “catastrophic forgetting” during TAPT on small data, which can partially overwrite broader domain knowledge gained via DAPT. This suggests future work may benefit from improvements in blending domain/task data or the use of adapter-based architectures.

Encoder-only models like AfroXLMR-Social retain state-of-the-art status on low-resource, domain-specific African NLP tasks compared to zero-shot LLMs, particularly when fine-tuned with relevant local data. Residual errors are concentrated on code-switched expressions, highly dialectal forms, sarcasm, and ambiguous emotional contexts—points of difficulty that may respond to expanded corpora or multi-task objectives.

7. Summary and Resources

AfroXLMR-Social establishes a strong paradigm for African language NLP: continual domain and task adaptation, without architectural changes, efficiently leverages social/news sentences (3.5M in AfriSocial) to augment model capability for three major subjective tasks across 14–17 languages. The approach delivers +4–7 Macro-F1 improvements and, via cross-task TAPT, can further boost performance. The model surpasses or matches advanced instruction-tuned LLMs in this context.

All data, models, and code are publicly accessible via HuggingFace: https://huggingface.co/tadesse/AfroXLMR-Social

(Belay et al., 24 Mar 2025)

PDF Markdown Chat (Pro)

References (1)

AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text (2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AfroXLMR-Social.