IBOM Dataset for Nigerian Coastal Languages
- IBOM dataset is a curated multilingual resource comprising parallel corpora and topic labels for four coastal Nigerian minority languages, enabling advanced MT and TC research.
- It features balanced data collected from Wikipedia and religious texts, validated by expert linguists with consistent normalization and review processes.
- Evaluation using metrics like BLEU, ChrF++, COMET, and classification accuracy highlights both current performance gaps and the potential for African-centric NLP improvements.
The IBOM dataset is a curated multilingual resource designed for machine translation (MT) and topic classification (TC) involving four coastal minority languages of Nigeria—Anaang, Efik, Ibibio, and Oro—spoken in Akwa Ibom State. These languages are notably absent from major machine translation and classification benchmarks such as Flores-200 and SIB-200 and are unrepresented in commercial systems like Google Translate. IBOM enables research extending the coverage and benchmarking of NLP models on these underresourced linguistic varieties, which have previously lacked systematically collected and validated textual corpora (Kalejaiye et al., 9 Nov 2025).
1. Data Collection and Composition
IBOM contains parallel corpora for each of the four target languages. All parallel data were obtained through professional translation of established resources:
- Sources: Flores-200 DEV (997 sentences), DEVTEST (1012 sentences)—both sampled from English Wikipedia covering varied topics—and 1000 sentences from NLLB-SEED training data.
- Translators: For each language, a team of three trained linguists (minimum BA in Linguistics, some with PhDs), led by a designated reviewer, produced and corrected translations over four months with weekly collaborative reviews.
- Additional Resources: JW300 provides an extra 331,000 en↔efi (English–Efik) parallel sentences focused on religious content, supporting cross-lingual transfer for MT system training.
- Monolingual Data: Although filtered web-crawled corpora (e.g., GlotCC, FineWeb-2) include some Ibom-language material, IBOM as released contains no new monolingual data; the focus remains on parallel data.
- Preprocessing: Only minimum normalization was applied: Unicode NFC, sentence-level tokenization, and orthography consistency as reviewed by language leads.
The parallel corpus counts per language and split are as follows:
| Split | Anaang | Efik | Ibibio | Oro | Total |
|---|---|---|---|---|---|
| Train | 1000 | 1000 | 1000 | 1000 | 4000 |
| Dev | 997 | 997 | 997 | 997 | 3988 |
| Test | 1012 | 1012 | 1012 | 1012 | 4048 |
| Total | 3009 | 3009 | 3009 | 3009 | 12036 |
2. Topic Alignment and Classification Taxonomy
IBOM-TC augments the parallel corpus with automatic SIB-200 topic labels for each sentence in the DEV and DEVTEST splits. The alignment process used the SIB-200 alignment script (Adelani et al., 2024) to assign one out of 32 categories, such as Politics, Health, Sports, Science, Entertainment, and Religion.
An abridged example of aggregated counts per topic is:
| Topic | Train | Dev | Test | Total |
|---|---|---|---|---|
| Health | 30 | 4 | 8 | 42 |
| Politics | 45 | 7 | 12 | 64 |
| Sports | 28 | 5 | 10 | 43 |
| ... | ... | ... | ... | ... |
| All 32 | 701 | 99 | 204 | 1004 |
No further stratification was performed; splits mirror the underlying Flores-200 partitions. For all four languages, the numbers of IBOM-MT and IBOM-TC examples are:
| Split | #IBOM-MT | #IBOM-TC |
|---|---|---|
| Train | 1000 | 701 |
| Dev | 997 | 99 |
| Test | 1012 | 204 |
| Total | 3009 | 1004 |
3. Evaluation Metrics
For quantifying model performance, IBOM employs established automatic and human-centered metrics:
- Machine Translation:
- BLEU:
where is the n-gram precision. - ChrF++:
using character-level precision and recall. - COMET:
where is a multilingual encoder.
Topic Classification:
- Accuracy:
- Precision, Recall, and F1 using standard classification definitions.
4. Experimental Benchmarks
4.1 Machine Translation
Evaluation of both fine-tuned (M2M-100, NLLB-200) and LLM-based systems (Gemini 2.0/2.5, GPT-4.1) was conducted for en↔X directions.
ChrF++ scores (averaged across language pairs and directions):
- Fine-tuned M2M-100 2-stage: 28.5
- Fine-tuned NLLB-200 2-stage: 28.0
- Gemini 2.5 (5-shot): 30.2
- BLEU scores (averaged):
- M2M-100 2-stage: 6.8
- NLLB-200 2-stage: 6.1
- Gemini 2.5 (0-shot): 10.8
- SSA-COMET (averaged):
- NLLB-200 2-stage: 39.9
- Gemini 2.5 (0-shot): 43.0
- Human Direct Assessment (50 test sents, en←X):
- Gemini 2.5 (10-shot): [Anaang: 9.4, Efik: 51.1, Ibibio: 16.8]
- M2M-100 2-stage: [Anaang: 31.0, Efik: 71.3, Ibibio: 9.8]
This suggests that current LLMs perform poorly on MT for these languages, although LLM few-shot performance on topic classification (see below) is competitive.
4.2 Topic Classification
Accuracy results highlight the importance of African-centric encoders:
| Model | Anaang | Efik | Ibibio | Oro | Eng | Avg. |
|---|---|---|---|---|---|---|
| XLM-R | 65.0 | 57.5 | 54.9 | 46.1 | 91.8 | 52.8 |
| AfroXLMR-61L | 69.6 | 71.3 | 66.6 | 66.5 | 90.4 | 68.1 |
| Gemini 2.5 (0-shot) | 70.1 | 76.5 | 74.0 | 51.0 | 87.8 | 67.9 |
| Gemini 2.5 (20-shot) | 73.5 | 79.4 | 80.9 | 65.2 | 88.7 | 74.8 |
African-centric encoders (AfroXLMR-61L) outperform generic XLM-R; Gemini 2.5 few-shot (20-shot) nearly closes the gap with supervised baselines.
5. Limitations and Challenges
IBOM provides the most comprehensive parallel corpora to date for these four minority languages, but significant coverage and methodological limitations remain:
- Only four out of more than 500 Nigerian languages are included; generalization is untested.
- The primary domains are Wikipedia and religious content (Efik via JW300); conversational or social-media texts are absent.
- No new monolingual corpora; tasks like language modeling pretraining remain unexplored.
- LLM evaluations were restricted to four proprietary models, limiting transferability and replicability.
A plausible implication is that, while IBOM reduces some barriers for low-resource NLP in Nigeria, many linguistic and domain-specific representation gaps remain unaddressed.
6. Extensions and Future Directions
Immediate research opportunities highlighted by the creators include:
- Expansion of domain coverage and corpus diversity (e.g., newswire, social media, oral-to-text resources).
- Mining and OCR-based augmentation of monolingual text for Ibom languages.
- Systematic study of phenomena such as zero-pronoun occurrence and code-mixing within topic classification.
- Development or adaptation of African-centric small LLMs, pre-trained on Ibom textual data.
- Increased use of human-rated MT evaluation (such as MQM and DA) to refine and calibrate learned metrics.
All data, evaluation scripts, and models are publicly released (https://huggingface.co/collections/howard-nlp/ibom-nlp) to foster further research and downstream applications in inclusive NLP for underrepresented Nigerian languages (Kalejaiye et al., 9 Nov 2025).