Building a Functional Machine Translation Corpus for Kpelle (2505.18905v1)

Published 24 May 2025 in cs.CL

Abstract: In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and LLMling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.

PDF Abstract

Building a Functional Machine Translation Corpus for Kpelle

The paper "Building a Functional Machine Translation Corpus for Kpelle" introduces a significant contribution to the field of machine translation and NLP by constructing the first publicly accessible English-Kpelle dataset. The dataset comprises over 2,000 translation pairs derived from various domains, including everyday communication, religious texts, and educational materials. This initiative addresses the challenges associated with low-resource languages like Kpelle, which are largely underrepresented in digital platforms and computational linguistics.

Dataset Construction and Methodology

The paper outlines a meticulous process of data collection from sources relevant to the cultural and linguistic context of Kpelle. Data was extracted from travel-related phrases, religious texts, and educational materials. Each piece of collected text was translated, verified by native Kpelle speakers, cleaned, normalized, and segmented into manageable units for subsequent processing. The final dataset consists of 14,790 words in Kpelle and 15,231 in English, spanning 4,369 sentences across different domains.

Experimental Validation using the NLLB Model

To evaluate the dataset's utility, the authors fine-tuned Meta’s No Language Left Behind (NLLB) model on two versions of their dataset. The fine-tuning process involved leveraging the SentencePiece model to handle out-of-vocabulary tokens, and enriching the translation model with specific language tokens. The model was tested across multiple training steps, resulting in BLEU scores of up to 30 for Kpelle-to-English translations, particularly at higher training counts. These results are indicative of the quality improvements achievable with data augmentation for low-resource languages.

Findings and Comparative Analysis

The BLEU scores and other metrics, including chrF2++, precision of 1-4 grams, and brevity penalty, demonstrate that the translation quality for Kpelle has reached competitive benchmarks comparable with other African languages that benefit from NLLB-200's systems. For a language that lacked any publicly available dataset before this work, achieving a BLEU score of 30 is noteworthy within the machine translation community. However, the translation quality from English to Kpelle trails slightly, underscoring the challenges inherent to the non-standardized orthography and limited resources typical of Kpelle.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, the dataset provides a foundation for developing NLP applications that enhance communication for Kpelle speakers, including machine translation, LLMing, and speech recognition. Theoretically, the research affords an opportunity to explore the linguistic structures and unique challenges presented by Kpelle, contributing to a broader understanding of low-resource language processing.

Looking ahead, further expansion of the dataset, inclusion of more dialect-specific data, and benchmarking against diverse model architectures could unlock even higher translation performance and broaden applicability across NLP tasks. As AI continues to evolve, initiatives like this are pivotal in ensuring linguistic inclusivity and diversity.

In conclusion, the paper represents a substantial step toward bridging the gap between low-resource languages and modern language technology. By providing a replicable framework for dataset creation and demonstrating significant translation performance potential, it sets the stage for the continued integration of underrepresented languages into the global digital ecosystem.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Kweku Andoh Yamoah (2 papers)
Jackson Weako (1 paper)
Emmanuel J. Dorley (1 paper)

Related Papers

Find Related Papers

YouTube

Show All Videos