Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects (2406.19564v1)

Published 27 Jun 2024 in cs.CL

Abstract: Yor`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR`ULECT across three domains and four regional Yor`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR`ULECT dataset and models publicly under an open license.

PDF HTML Abstract

Overview of "Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects"

This paper addresses a significant gap in NLP resources for the Yorùbá language, an African language encompassing a rich tapestry of dialects. Despite Yorùbá having approximately 47 million speakers, most NLP efforts for this language have traditionally focused on its standard dialect. This research counterbalances the limitations faced by minority dialects by introducing the YORÙ LECT corpus, a comprehensive high-quality parallel text and speech dataset spanning multiple domains and four distinct regional Yorùbá dialects.

YORÙ LECT Corpus and Experiments

The publication of the YORÙ LECT corpus aims to alleviate the challenge of dialectal disparities by providing a resource that exhibits the linguistic diversity of Yorùbá. The authors collected and annotated 1,506 sentences in four dialects: Standard Yorùbá, Ifè, Ìjèbú, and Ìlàje, covering three key domains: religious texts, news articles, and TED talks. Through this corpus, experiments were conducted on machine translation (MT), automatic speech recognition (ASR), and speech-to-text translation (S2TT). The results demonstrated that contemporary models fall short of addressing dialectal variations adequately, though improvements were observed after applying dialect-adaptive finetuning techniques.

Numerical Results and Findings

Assessing the zero-shot performance of several machine translation systems, including M2M100, NLLB, GMNMT, Menyo, MT0, and Aya, revealed substantial underperformance on non-standard dialects as compared to Standard Yorùbá. Notably, Google Translate exhibited superior performance in the zero-shot evaluation but revealed substantial dialectal gaps, especially for Ìlàje, as indicated by the BLEU scores.

After applying dialect-adaptive finetuning, the BLEU scores for non-standard dialects improved significantly by 14 and 5 points for MT and S2TT, respectively, while ASR errors decreased by 20 points. Nonetheless, these tasks remained challenging, particularly the S2TT, which exhibited only modest improvements post-finetuning.

Implications and Future Directions

The establishment of a resource such as YORÙ LECT is a pivotal move in fostering NLP advancements for underrepresented African languages and dialects. This corpus not only aids in improving the quality and inclusivity of language technologies for Yorùbá but also serves as a template for similar efforts in other low-resource languages. There is an evident call to action within the research community to adopt such dialect-inclusive approaches, promoting fairness and equity in language technology.

The work underscores the necessity of expanding datasets and leveraging dialect-specific data to better model linguistic phenomena and variations. Furthermore, practical tools developed from such resources could play a crucial role in educational, cultural, and communication spheres, particularly in regions where these dialects are predominant.

Speculation on Future Developments

Long-term developments in this field may include creating more robust models that can natively handle dialectal variations or integrating the YORÙ LECT corpus with additional sources for more extensive pretraining datasets. Collaborations with linguists to explore authentic dialect characteristics and applying advanced techniques such as parameter-efficient finetuning could enhance the precision and adaptability of NLP systems.

Overall, this paper contributes a valuable resource to the often-overlooked paper of dialectal NLP, providing an impetus for future research to consider the diversity within languages as a primary factor, rather than an afterthought, in system development and deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Orevaoghene Ahia (23 papers)
Anuoluwapo Aremu (16 papers)
Diana Abagyan (2 papers)
Hila Gonen (30 papers)
David Ifeoluwa Adelani (59 papers)
Daud Abolade (4 papers)
Noah A. Smith (224 papers)
Yulia Tsvetkov (142 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/davlanade/status/1838277018133184874

https://twitter.com/realmofresearch/status/1807775926685675604