IBOM Dataset for Nigerian Coastal Languages

Updated 11 January 2026

IBOM dataset is a curated multilingual resource comprising parallel corpora and topic labels for four coastal Nigerian minority languages, enabling advanced MT and TC research.
It features balanced data collected from Wikipedia and religious texts, validated by expert linguists with consistent normalization and review processes.
Evaluation using metrics like BLEU, ChrF++, COMET, and classification accuracy highlights both current performance gaps and the potential for African-centric NLP improvements.

The IBOM dataset is a curated multilingual resource designed for machine translation (MT) and topic classification (TC) involving four coastal minority languages of Nigeria—Anaang, Efik, Ibibio, and Oro—spoken in Akwa Ibom State. These languages are notably absent from major machine translation and classification benchmarks such as Flores-200 and SIB-200 and are unrepresented in commercial systems like Google Translate. IBOM enables research extending the coverage and benchmarking of NLP models on these underresourced linguistic varieties, which have previously lacked systematically collected and validated textual corpora (Kalejaiye et al., 9 Nov 2025).

1. Data Collection and Composition

IBOM contains parallel corpora for each of the four target languages. All parallel data were obtained through professional translation of established resources:

Sources: Flores-200 DEV (997 sentences), DEVTEST (1012 sentences)—both sampled from English Wikipedia covering varied topics—and 1000 sentences from NLLB-SEED training data.
Translators: For each language, a team of three trained linguists (minimum BA in Linguistics, some with PhDs), led by a designated reviewer, produced and corrected translations over four months with weekly collaborative reviews.
Additional Resources: JW300 provides an extra 331,000 en↔efi (English–Efik) parallel sentences focused on religious content, supporting cross-lingual transfer for MT system training.
Monolingual Data: Although filtered web-crawled corpora (e.g., GlotCC, FineWeb-2) include some Ibom-language material, IBOM as released contains no new monolingual data; the focus remains on parallel data.
Preprocessing: Only minimum normalization was applied: Unicode NFC, sentence-level tokenization, and orthography consistency as reviewed by language leads.

The parallel corpus counts per language and split are as follows:

Split	Anaang	Efik	Ibibio	Oro	Total
Train	1000	1000	1000	1000	4000
Dev	997	997	997	997	3988
Test	1012	1012	1012	1012	4048
Total	3009	3009	3009	3009	12036

2. Topic Alignment and Classification Taxonomy

IBOM-TC augments the parallel corpus with automatic SIB-200 topic labels for each sentence in the DEV and DEVTEST splits. The alignment process used the SIB-200 alignment script (Adelani et al., 2024) to assign one out of 32 categories, such as Politics, Health, Sports, Science, Entertainment, and Religion.

An abridged example of aggregated counts per topic is:

Topic	Train	Dev	Test	Total
Health	30	4	8	42
Politics	45	7	12	64
Sports	28	5	10	43
...	...	...	...	...
All 32	701	99	204	1004

No further stratification was performed; splits mirror the underlying Flores-200 partitions. For all four languages, the numbers of IBOM-MT and IBOM-TC examples are:

Split	#IBOM-MT	#IBOM-TC
Train	1000	701
Dev	997	99
Test	1012	204
Total	3009	1004

3. Evaluation Metrics

For quantifying model performance, IBOM employs established automatic and human-centered metrics:

Machine Translation:
- BLEU:
$\mathrm{BLEU} = \mathrm{BP} \exp\Bigl(\sum_{n=1}^N w_n\log p_n\Bigr)$

where $p_n$ is the n-gram precision. - ChrF++:

$\mathrm{ChrF}++ = \frac{1 + \beta^2}{\frac{\beta^2}{\mathrm{R_{char}}} + \frac{1}{\mathrm{P_{char}}}}$

using character-level precision and recall. - COMET:

$\mathrm{COMET}(s, t) = \mathrm{FFN}(\mathrm{Enc}(s), \mathrm{Enc}(t))$

where $\mathrm{Enc}$ is a multilingual encoder.
Topic Classification:
- Accuracy:
$\frac{\sum_{i=1}^m\mathbf{1}(y_i = \hat y_i)}{m}$ - Precision, Recall, and F1 using standard classification definitions.

4. Experimental Benchmarks

4.1 Machine Translation

Evaluation of both fine-tuned (M2M-100, NLLB-200) and LLM-based systems (Gemini 2.0/2.5, GPT-4.1) was conducted for en↔X directions.

ChrF++ scores (averaged across language pairs and directions):
- Fine-tuned M2M-100 2-stage: 28.5
- Fine-tuned NLLB-200 2-stage: 28.0
- Gemini 2.5 (5-shot): 30.2
BLEU scores (averaged):
- M2M-100 2-stage: 6.8
- NLLB-200 2-stage: 6.1
- Gemini 2.5 (0-shot): 10.8
SSA-COMET (averaged):
- NLLB-200 2-stage: 39.9
- Gemini 2.5 (0-shot): 43.0
Human Direct Assessment (50 test sents, en←X):
- Gemini 2.5 (10-shot): [Anaang: 9.4, Efik: 51.1, Ibibio: 16.8]
- M2M-100 2-stage: [Anaang: 31.0, Efik: 71.3, Ibibio: 9.8]

This suggests that current LLMs perform poorly on MT for these languages, although LLM few-shot performance on topic classification (see below) is competitive.

4.2 Topic Classification

Accuracy results highlight the importance of African-centric encoders:

Model	Anaang	Efik	Ibibio	Oro	Eng	Avg.
XLM-R	65.0	57.5	54.9	46.1	91.8	52.8
AfroXLMR-61L	69.6	71.3	66.6	66.5	90.4	68.1
Gemini 2.5 (0-shot)	70.1	76.5	74.0	51.0	87.8	67.9
Gemini 2.5 (20-shot)	73.5	79.4	80.9	65.2	88.7	74.8

African-centric encoders (AfroXLMR-61L) outperform generic XLM-R; Gemini 2.5 few-shot (20-shot) nearly closes the gap with supervised baselines.

5. Limitations and Challenges

IBOM provides the most comprehensive parallel corpora to date for these four minority languages, but significant coverage and methodological limitations remain:

Only four out of more than 500 Nigerian languages are included; generalization is untested.
The primary domains are Wikipedia and religious content (Efik via JW300); conversational or social-media texts are absent.
No new monolingual corpora; tasks like language modeling pretraining remain unexplored.
LLM evaluations were restricted to four proprietary models, limiting transferability and replicability.

A plausible implication is that, while IBOM reduces some barriers for low-resource NLP in Nigeria, many linguistic and domain-specific representation gaps remain unaddressed.

6. Extensions and Future Directions

Immediate research opportunities highlighted by the creators include:

Expansion of domain coverage and corpus diversity (e.g., newswire, social media, oral-to-text resources).
Mining and OCR-based augmentation of monolingual text for Ibom languages.
Systematic study of phenomena such as zero-pronoun occurrence and code-mixing within topic classification.
Development or adaptation of African-centric small LLMs, pre-trained on Ibom textual data.
Increased use of human-rated MT evaluation (such as MQM and DA) to refine and calibrate learned metrics.

All data, evaluation scripts, and models are publicly released (https://huggingface.co/collections/howard-nlp/ibom-nlp) to foster further research and downstream applications in inclusive NLP for underrepresented Nigerian languages (Kalejaiye et al., 9 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IBOM Dataset.