Papers
Topics
Authors
Recent
Search
2000 character limit reached

IBOM Dataset for Nigerian Coastal Languages

Updated 11 January 2026
  • IBOM dataset is a curated multilingual resource comprising parallel corpora and topic labels for four coastal Nigerian minority languages, enabling advanced MT and TC research.
  • It features balanced data collected from Wikipedia and religious texts, validated by expert linguists with consistent normalization and review processes.
  • Evaluation using metrics like BLEU, ChrF++, COMET, and classification accuracy highlights both current performance gaps and the potential for African-centric NLP improvements.

The IBOM dataset is a curated multilingual resource designed for machine translation (MT) and topic classification (TC) involving four coastal minority languages of Nigeria—Anaang, Efik, Ibibio, and Oro—spoken in Akwa Ibom State. These languages are notably absent from major machine translation and classification benchmarks such as Flores-200 and SIB-200 and are unrepresented in commercial systems like Google Translate. IBOM enables research extending the coverage and benchmarking of NLP models on these underresourced linguistic varieties, which have previously lacked systematically collected and validated textual corpora (Kalejaiye et al., 9 Nov 2025).

1. Data Collection and Composition

IBOM contains parallel corpora for each of the four target languages. All parallel data were obtained through professional translation of established resources:

  • Sources: Flores-200 DEV (997 sentences), DEVTEST (1012 sentences)—both sampled from English Wikipedia covering varied topics—and 1000 sentences from NLLB-SEED training data.
  • Translators: For each language, a team of three trained linguists (minimum BA in Linguistics, some with PhDs), led by a designated reviewer, produced and corrected translations over four months with weekly collaborative reviews.
  • Additional Resources: JW300 provides an extra 331,000 en↔efi (English–Efik) parallel sentences focused on religious content, supporting cross-lingual transfer for MT system training.
  • Monolingual Data: Although filtered web-crawled corpora (e.g., GlotCC, FineWeb-2) include some Ibom-language material, IBOM as released contains no new monolingual data; the focus remains on parallel data.
  • Preprocessing: Only minimum normalization was applied: Unicode NFC, sentence-level tokenization, and orthography consistency as reviewed by language leads.

The parallel corpus counts per language and split are as follows:

Split Anaang Efik Ibibio Oro Total
Train 1000 1000 1000 1000 4000
Dev 997 997 997 997 3988
Test 1012 1012 1012 1012 4048
Total 3009 3009 3009 3009 12036

2. Topic Alignment and Classification Taxonomy

IBOM-TC augments the parallel corpus with automatic SIB-200 topic labels for each sentence in the DEV and DEVTEST splits. The alignment process used the SIB-200 alignment script (Adelani et al., 2024) to assign one out of 32 categories, such as Politics, Health, Sports, Science, Entertainment, and Religion.

An abridged example of aggregated counts per topic is:

Topic Train Dev Test Total
Health 30 4 8 42
Politics 45 7 12 64
Sports 28 5 10 43
... ... ... ... ...
All 32 701 99 204 1004

No further stratification was performed; splits mirror the underlying Flores-200 partitions. For all four languages, the numbers of IBOM-MT and IBOM-TC examples are:

Split #IBOM-MT #IBOM-TC
Train 1000 701
Dev 997 99
Test 1012 204
Total 3009 1004

3. Evaluation Metrics

For quantifying model performance, IBOM employs established automatic and human-centered metrics:

  • Machine Translation:
    • BLEU:

    BLEU=BPexp(n=1Nwnlogpn)\mathrm{BLEU} = \mathrm{BP} \exp\Bigl(\sum_{n=1}^N w_n\log p_n\Bigr)

    where pnp_n is the n-gram precision. - ChrF++:

    ChrF++=1+β2β2Rchar+1Pchar\mathrm{ChrF}++ = \frac{1 + \beta^2}{\frac{\beta^2}{\mathrm{R_{char}}} + \frac{1}{\mathrm{P_{char}}}}

    using character-level precision and recall. - COMET:

    COMET(s,t)=FFN(Enc(s),Enc(t))\mathrm{COMET}(s, t) = \mathrm{FFN}(\mathrm{Enc}(s), \mathrm{Enc}(t))

    where Enc\mathrm{Enc} is a multilingual encoder.

  • Topic Classification:

    • Accuracy:

    i=1m1(yi=y^i)m\frac{\sum_{i=1}^m\mathbf{1}(y_i = \hat y_i)}{m} - Precision, Recall, and F1 using standard classification definitions.

4. Experimental Benchmarks

4.1 Machine Translation

Evaluation of both fine-tuned (M2M-100, NLLB-200) and LLM-based systems (Gemini 2.0/2.5, GPT-4.1) was conducted for en↔X directions.

  • ChrF++ scores (averaged across language pairs and directions):

    • Fine-tuned M2M-100 2-stage: 28.5
    • Fine-tuned NLLB-200 2-stage: 28.0
    • Gemini 2.5 (5-shot): 30.2
  • BLEU scores (averaged):
    • M2M-100 2-stage: 6.8
    • NLLB-200 2-stage: 6.1
    • Gemini 2.5 (0-shot): 10.8
  • SSA-COMET (averaged):
    • NLLB-200 2-stage: 39.9
    • Gemini 2.5 (0-shot): 43.0
  • Human Direct Assessment (50 test sents, en←X):
    • Gemini 2.5 (10-shot): [Anaang: 9.4, Efik: 51.1, Ibibio: 16.8]
    • M2M-100 2-stage: [Anaang: 31.0, Efik: 71.3, Ibibio: 9.8]

This suggests that current LLMs perform poorly on MT for these languages, although LLM few-shot performance on topic classification (see below) is competitive.

4.2 Topic Classification

Accuracy results highlight the importance of African-centric encoders:

Model Anaang Efik Ibibio Oro Eng Avg.
XLM-R 65.0 57.5 54.9 46.1 91.8 52.8
AfroXLMR-61L 69.6 71.3 66.6 66.5 90.4 68.1
Gemini 2.5 (0-shot) 70.1 76.5 74.0 51.0 87.8 67.9
Gemini 2.5 (20-shot) 73.5 79.4 80.9 65.2 88.7 74.8

African-centric encoders (AfroXLMR-61L) outperform generic XLM-R; Gemini 2.5 few-shot (20-shot) nearly closes the gap with supervised baselines.

5. Limitations and Challenges

IBOM provides the most comprehensive parallel corpora to date for these four minority languages, but significant coverage and methodological limitations remain:

  • Only four out of more than 500 Nigerian languages are included; generalization is untested.
  • The primary domains are Wikipedia and religious content (Efik via JW300); conversational or social-media texts are absent.
  • No new monolingual corpora; tasks like language modeling pretraining remain unexplored.
  • LLM evaluations were restricted to four proprietary models, limiting transferability and replicability.

A plausible implication is that, while IBOM reduces some barriers for low-resource NLP in Nigeria, many linguistic and domain-specific representation gaps remain unaddressed.

6. Extensions and Future Directions

Immediate research opportunities highlighted by the creators include:

  • Expansion of domain coverage and corpus diversity (e.g., newswire, social media, oral-to-text resources).
  • Mining and OCR-based augmentation of monolingual text for Ibom languages.
  • Systematic study of phenomena such as zero-pronoun occurrence and code-mixing within topic classification.
  • Development or adaptation of African-centric small LLMs, pre-trained on Ibom textual data.
  • Increased use of human-rated MT evaluation (such as MQM and DA) to refine and calibrate learned metrics.

All data, evaluation scripts, and models are publicly released (https://huggingface.co/collections/howard-nlp/ibom-nlp) to foster further research and downstream applications in inclusive NLP for underrepresented Nigerian languages (Kalejaiye et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IBOM Dataset.