Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks (2208.12081v2)

Published 25 Aug 2022 in cs.CL

Abstract: Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data collection was done by researchers from communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442 texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively) were developed. We developed 7,537 Question-Answer pairs for Swahili and created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. We also developed two proof of concept systems: for Kiswahili speech-to-text and machine learning system for Question Answering task, with results of 18.87% word error rate and 80% Exact Match (EM) respectively. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a comprehensive Kenyan language corpus with 5.6M words and 177 hours of speech to boost NLP for Swahili, Dholuo, and Luhya.
The study employs extensive fieldwork and multi-source data collection, overcoming pandemic challenges to assemble unique text and speech datasets.
The paper demonstrates practical applications with an 18.87% error rate for speech-to-text and a 59.4% F1 score for QA, highlighting effective low-resource language processing.

An Analysis of Kencorpus: A Kenyan Language Corpus for NLP Tasks

The Kencorpus project represents an essential endeavor in the development of language resources for under-served languages, a crucial step for the advancement of NLP applications in Africa. The paper, conducted by Wanjawa et al., focuses on Swahili, Dholuo, and Luhya—languages predominantly spoken in Kenya—and addresses the dire need for data to enable various NLP tasks, such as machine translation, question answering (QA), and speech-to-text (STT) systems.

Data Compilation and Methodology

The researchers faced the challenge of collecting both text and speech data across three languages, involving extensive fieldwork engaging communities, educational institutions, media outlets, and publishers. The result was a comprehensive dataset comprising 5.6 million words in 4,442 text documents and 177 hours of speech data from 1,152 files. Given the project's limited timeline and the constraints imposed by the COVID-19 pandemic, the achievement in data collection resonates with significant dedication and strategic planning.

The corpus includes specific annotations: 143,000 words were tagged for parts of speech (POS) across Dholuo and variants of Luhya. Concurrently, they assembled a QA dataset featuring 7,537 pairs from Swahili text and facilitated translation datasets with 13,400 sentences translating between Dholuo and Luhya into Swahili. These carefully curated datasets are pivotal for addressing the shortcomings in digital inclusivity that these languages face.

Demonstrating Practical Applications

To validate the corpus's utility, the team developed two proof-of-concept systems. The first was a Swahili speech-to-text system, achieving a 18.87% word error rate—an impressive statistic given the modest volume of training data available. The second was a question answering model, tested using deep learning techniques (XLM-RoBERTa model), providing a promising 59.4% F1 score. Notably, a semantic network-based method achieved an 80% exact match in the QA task, indicating potential pathways for efficient NLP tasks that might circumvent the need for large datasets, a frequent limitation for low-resource languages.

Implications and Future Developments

Kencorpus stands as a seminal contribution to the expanding field of computational linguistics aimed at low-resource languages. By providing open-access datasets under the CC BY 4.0 license, the project invites further research and development within the machine learning community. Empirical results suggest that while machine learning techniques tailored to high-resource languages underperform with small datasets, alternative approaches such as semantic networks can yield effective results with limited input data.

Moving forward, the corpus can be continuously augmented with new data entries. This would broaden its usability for training machine learning models, theoretically improving performance in the tasks it supports. The project's execution also lays a groundwork for similar initiatives, highlighting key considerations in data collection and processing in under-resourced linguistic environments.

Conclusion

The Kencorpus project effectively bridges a critical gap in the availability of language resources for Swahili, Dholuo, and Luhya. The resulting corpus not only facilitates immediate applications within question answering and speech-to-text systems but also fosters further NLP research. Moreover, it sets a precedent for future corpus-building initiatives, encouraging a scalable approach to accommodating the diverse linguistic landscape of Africa and other regions with low-resource languages.

PDF Markdown