Papers
Topics
Authors
Recent
2000 character limit reached

Coastal Nigerian Languages: Classification & NLP Benchmarks

Updated 16 November 2025
  • Coastal Nigerian languages are a coherent group within the Lower Cross division of the New Benue-Congo family, including Ibibio, Efik, Anaang, and Oro.
  • They share typological features such as Latin-based orthographies with diacritics, agglutinative morphology, and flexible SVO syntax, despite regional variations.
  • The IBOM dataset provides the first parallel machine translation and topic classification benchmarks, enhancing NLP resources for these under-represented languages.

Coastal Nigerian languages, as typified by Anaang, Efik, Ibibio, and Oro, constitute a linguistically and typologically coherent set within the Lower Cross division of the New Benue-Congo branch of the Niger-Congo family. Predominantly spoken in Nigeria’s Akwa Ibom State and adjoining regions of Cross River State, these languages are marked both by considerable speaker populations—Ibibio and Efik each above three million—and a history of limited computational and NLP resource development, despite their regional sociolinguistic significance. The Ibom dataset represents the first publicly available resource for parallel machine translation and topic classification benchmarks for these languages (Kalejaiye et al., 9 Nov 2025).

1. Genealogical and Typological Classification

All four languages are classified under the New Benue-Congo branch of Niger-Congo, specifically within the Lower Cross division:

  • Efik, Ibibio, and Anaang: Efik-Ibibio sub-branch.
  • Oro: Nsàng sub-branch.

Ibibio, functioning as the lingua franca of Akwa Ibom State, numbers approximately 3.7 million speakers. Efik has a comparable population (~3.5 million speakers) distributed across Akwa Ibom coastal LGAs, adjacent Cross River State, and portions of southwest Cameroon. Anaang, with 1.4 million speakers, is concentrated in the northwestern local government areas (LGAs), while Oro claims about 400,000 speakers in the southeast LGAs clustered around Oron.

The languages exhibit a shared typological profile:

  • Latin-based orthographies with under-dot diacritics (e.g., ị, ȯ, ụ).
  • Five phonemic tones: High (H), Low (L), Downstep (D), Rising (R), Falling (F). Tone marking is not standard in practical orthography.
  • Agglutinative morphology with prefixing inflectional mechanisms.
  • Flexible SVO syntax (e.g., "I am going to school": Ibibio—ami n-ka ufọk-nwed; Efik—ami n-ka ufọk-nwed; Anaang—ami n-ka ufọk-ŋwed; Oro—ami n-ga uvọk-nwid).

Phonological Inventory, Syllable Structure, and Tone

Language Vowels Representative Consonants Syllable Types Tone Inventory
Ibibio a, e, i, ị, o, ȯ, u, ụ, ʌ, ə b, d, f, gh, h, k, kp, m, n, ŋ, ŋw, ny, p, s, t, w, y N, V, CV, CVC, CVV, CVVC, CGV H, L, D, R, F
Efik a, e, i, ị, o, ȯ, u, ʌ p, b, d, f, g, h, k, kp, kw, m, n, ŋ, ny, r, s, t, w, y N, V, CV, CVC, CVV, CVVC, CGV H, L, D, R, F
Anaang a, e, i, o, ȯ, u, ụ b, ch, d, f, gh, gw, j, k, kp, kw, l, m, n, ŋ, ŋw, ny, p, r, s, t, w, y N, V, CV, CVV, CVC, CVVC H, L, D, R, F
Oro a, e, ẹ, i, ị, o, ȯ, u b, d, f, g, gb, gh, gw, j, k, kp, kw, l, m, n, ŋ, ŋw, ny, r, s, t, v, w, y, z N, V, CV, CVC, CVV, CVVC, CGV H, L, D, R, F

2. Dataset Construction Workflows

The IBOM dataset extends machine translation and topic classification resources to these four languages for the first time.

Corpus Composition and Preprocessing

  • Source data: English sentences from Flores-200 (Wikipedia-based; 997 dev, 1,012 test) and 1,000 train sentences from NLLB-SEED.
  • Each language: 1,000 train, 997 dev, 1,012 test sentences—total 3,009 per language for IBOM-MT.
  • Topic labels: Inherited from SIB-200 (15 categories), mapped onto translation pairs for the IBOM-TC task (701 train, 99 dev, 204 test).

Human Annotation Pipeline

  • Three B.A.-qualified linguists per language, with at least one lead reviewer.
  • Assignment: One main translator per split; lead reviews for consistency and accuracy.
  • Quality assurance: Weekly meetings over four months for annotation harmonization.

Preprocessing Protocols

  • Orthography standardization (Latin script with under-dot diacritics).
  • Markup removal, punctuation normalization.
  • Raw sentence-level alignment; no automatic tokenization or segmentation.

Parallel Alignment

  • IBOM-MT: Parallelization strict by translation split/order.
  • IBOM-TC: SIB-200 scripts reused for label–sentence alignment.

Dataset Size Table

Task/Split IBOM-MT (Parallel) IBOM-TC (Labelled)
Train 1,000 701
Dev 997 99
Test 1,012 204
Total 3,009 1,004

3. Machine Translation and Topic Classification Benchmarks

Machine Translation (IBOM-MT)

  • Models: M2M-100 (418M parameters), NLLB-200 (600M parameters).
  • Two-stage fine-tuning: 1) Pre-fine-tuning on English↔Efik parallel religious text (331K sentences); 2) Further fine-tuning on Ibom language splits (1,000 sentences per language).
  • LLMs: GPT-4.1, o4-mini, Gemini 2.0 Flash, Gemini 2.5 Flash; evaluated in zero-shot, 5-, 10-, 20-shot configurations, with prompts drawn from train data.
  • Metrics:
    • BLEU: BLEU=exp(nwnlogpn)×nBP\text{BLEU} = \exp(\sum_n w_n \log p_n) \times \prod_n BP
    • ChrF++ (character n-gram F-score)
    • SSA-COMET (Africa-centric embedding-based metric)

Key MT Results (ChrF++ average, en↔X)

  • Baseline fine-tune: Avg ≈ 12
  • Two-stage fine-tune: en→X ≈ 27, X→en ≈ 30
  • Gemini 2.0 Flash zero-shot: ≈ 29 avg
  • Gemini 2.5 Flash (10-shot): en→X 30.1, X→en 29.0

BLEU trends are consistent with ChrF++; best BLEU ≈ 11 (en→Efik, two-stage). SSA-COMET indicates comparable effectiveness for two-stage and few-shot LLMs on Efik and Anaang. Human direct assessment on 50-sample batches confirms two-stage M2M-100 outperforms LLMs for Anaang and Efik, while Gemini 2.5 Flash marginally surpasses other models for Ibibio.

Topic Classification (IBOM-TC)

  • Fine-tuned encoders: XLM-R, Glot500, AfriBERTa, Serengeti, AfroXLMR, AfroXLMR-61L.
  • LLMs: Same as above, tested in zero- and few-shot setups.
  • Metric: Accuracy (with F1 for class imbalance).
  • Label taxonomy: 14 topics (as per SIB-200; e.g., politics, health, sports, technology).

Key Topic Classification Results (accuracy %)

  • AfroXLMR-61L fine-tuned: Average 68.1% (range: Ibibio 66.6%, Efik 71.3%, Anaang 69.6%, Oro 66.5%, English 90.4%)
  • Gemini 2.5 Flash zero-shot: Avg 67.9% (highest Efik 76.5%, lowest Oro 50.0%)
  • Gemini 2.5 Flash few-shot (20): Avg 74.8% (Oro 65.2%)
  • GPT-4.1, o4-mini: Minor improvements with more shots, but African-centric encoders outperform.

4. Analytical Challenges and Model Performance

Primary bottlenecks include data scarcity—no pre-existing parallel corpora for Anaang or Oro; limited monolingual web-crawl data. Orthographic inconsistency, notably diacritic omission in crawled web data, necessitates additional normalization protocols. Domain mismatch between Wikipedia-driven corpus material and spoken or informal registers limits generalizability and robustness.

Metric reliability poses further problems: BLEU and ChrF++ fail to consistently rank outputs, especially for low-resource, typologically distinct languages; SSA-COMET and human direct assessment offer more granular calibration for African languages. Oro translation performance lags due to greater genealogical distance from Efik, from which some transfer learning protocols derive benefit.

In zero-shot LLM settings, models commonly drop functional elements or misrender under-dot diacritics. Few-shot prompting yields consistent improvements for topic classification, with less marked impact for translation fidelity.

5. Proposed Solutions and Future Methodological Expansion

Proposed interventions center on data expansion—crowdsourcing for Ibom-MT and dedicated collection of domain-specific text. Monolingual corpora are prioritized for subword vocabulary adaptation and robust language modeling. Script normalization and automatic diacritic restoration pipelines are integral for input consistency.

Metric refinement and expansion of human evaluation remain priorities. Pretraining or adapting large multilingual encoders (e.g., AfroXLMR-61L) to Lower Cross data is expected to facilitate cross-lingual transfer.

Future work emphasizes the extension of parallel and classification resources to additional Coastal Nigerian and minority languages, such as Ibeno and Mbo. Task diversity is expected to increase, incorporating named entity recognition, sentiment analysis, and question answering, as well as audio corpora for speech recognition and speech-to-text. Community-driven annotation frameworks and collaboration with Nigerian language agencies are projected to support sustainable and standardized resource development.

6. Research Impact and Inclusive NLP

The IBOM dataset establishes a foundational public benchmark for machine translation and topic classification in Anaang, Efik, Ibibio, and Oro, previously excluded from major resources such as Google Translate, Flores-200, or SIB-200. Empirical evaluations highlight both the improvements achieved via two-stage fine-tuning and African-centric evaluation, and persistent gaps for under-represented groups (notably Oro). Incremental progress in translation and classification should continue with dataset size augmentation and model adaptation.

A plausible implication is that the IBOM resource serves not only as an empirical testbed for model evaluation, but also as a methodological blueprint for extending NLP research to other Nigerian and West African languages with parallel typological and sociolinguistic characteristics. This development constitutes a material advance toward truly inclusive NLP in Nigeria (Kalejaiye et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Coastal Nigerian Languages.