Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages (2010.02353v2)

Published 5 Oct 2020 in cs.CL, cs.AI, and cs.LG

Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.

PDF Abstract

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

This paper presents a compelling case paper on participatory research aimed at addressing the challenges posed by low-resourced machine translation (MT) with a specific focus on African languages. It is known that research in NLP often lacks geographic diversity and is predominantly concentrated on a few high-resourced languages. This work emphasizes participatory research as a viable methodology to involve necessary agents in the MT development process for low-resourced languages, demonstrating its feasibility and scalability through the Masakhane community initiative.

Overview

The core issue discussed is the "low-resourced" nature of many languages, particularly those in Africa, which stems from more than just a lack of data availability but from systemic societal problems. These languages often have fewer digital resources due to historical inequalities and limited academic or linguistic research opportunities in the regions where these languages are predominant.

The authors propose a participatory research model where non-traditional participants, such as native speakers without formal research training, are deeply involved in all aspects of the MT process—from defining the problem to collecting data, developing models, and evaluating results. This approach aims to leverage local expertise and foster inclusivity by breaking the traditional barriers of entry into NLP research.

Experimental Implementation

The paper details the implementation of this initiative through the Masakhane project, which brought together over 400 participants across 20 African countries. The participants were involved in various capacities including data collection, model training using the JoeyNMT framework on Google Colab, and translation evaluations.

One of the notable achievements of this participatory approach is the compilation of novel translation datasets and establishment of MT benchmarks for over 30 African languages. These benchmarks and resources have been made publicly available for further research and development.

Numerical and Evaluation Results

The paper highlights the development of 46 benchmarks for neural translation models focusing on African languages, revealing BLEU scores for various language pairs. Metrics from human evaluations are emphasized for certain languages, showcasing discrepancies in automated BLEU scores and the reality of translation quality in out-of-domain settings like COVID-19 surveys and TED talks.

The results indicate substantial differences in translation qualities derived from human evaluations versus automated metrics, with particular challenges observed in languages with varying dialects and terminologies, such as Igbo.

Implications and Future Directions

The participatory approach demonstrated in this paper provides both practical and theoretical implications. Practically, it allows for the development of MT systems that are more culturally and linguistically aligned with the low-resourced languages they aim to support. Theoretically, it demonstrates the potential effectiveness and scalability of participatory methodologies in addressing resource-scarce challenges in NLP.

Furthermore, the Masakhane project serves as a prototype for future initiatives aimed at incorporating low-resourced languages into the broader NLP landscape. It emphasizes the importance of inclusivity and local expertise in the development of language technologies, suggesting that such models could be adapted to other geographic contexts facing similar challenges.

This paper encourages a rethinking of the approach to solving the "low-resourced" language problem by integrating community-driven efforts to enhance accessibility and efficacy of MT systems globally. The authors advocate for expanding this collaborative format, highlighting the value of integrating stakeholders from inception to implementation in NLP research processes.

PDF Markdown Bookmark Chat (Pro)

Authors (48)

Wilhelmina Nekoto (1 paper)
Vukosi Marivate (47 papers)
Tshinondiwa Matsila (1 paper)
Timi Fasubaa (2 papers)
Tajudeen Kolawole (1 paper)
Taiwo Fagbohungbe (1 paper)
Solomon Oluwole Akinola (2 papers)
Shamsuddeen Hassan Muhammad (42 papers)
Salomon Kabongo (10 papers)
Salomey Osei (21 papers)
Sackey Freshia (3 papers)
Rubungo Andre Niyongabo (4 papers)
Ricky Macharm (2 papers)
Perez Ogayo (12 papers)
Orevaoghene Ahia (23 papers)
Musie Meressa (2 papers)
Mofe Adeyemi (2 papers)
Masabata Mokgesi-Selinga (1 paper)
Lawrence Okegbemi (1 paper)
Laura Jane Martinus (1 paper)

Citations (187)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - masakhane-io/masakhane-mt: Machine Translation for Africa (278 stars)