Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Published 7 May 2025 in cs.CL and cs.AI | (2505.04531v1)

Abstract: Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in NLP. This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages

The paper "Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review" presents a comprehensive assessment of strategies aimed at mitigating data scarcity challenges in the development of generative LLMs for low-resource languages (LRLs). Given the crucial role of generative LLMs in enhancing communication and preserving linguistic diversity, it is imperative to address the disparity between high-resource and low-resource languages. This review synthesizes findings from 54 studies, presenting a critical examination of technical methods that have been employed to augment data availability and improve model performance for LRLs.

Technical Approaches to Data Scarcity

The paper identifies several key technical approaches frequently adopted in the literature to overcome data scarcity in LRL modeling. These include:

Monolingual Data Augmentation: Techniques such as paraphrasing, grammatical transformations, and data enrichment are widely used to generate synthetic data from existing corpora. These approaches are praised for their ability to enhance data diversity and reduce the bias inherent in small datasets.
Back-Translation: This method involves creating additional training examples by translating text to another language and back to the original language, thus augmenting the source data. Despite its promise, the efficacy of back-translation is closely tied to the quality of the machine translation system employed.
Multilingual Training: Training models on data from multiple languages can leverage shared linguistic structures, facilitating cross-lingual transfer and improving performance in languages with limited data availability.
Prompt Engineering: Existing models are prompted with carefully constructed queries and instructions to test their ability to generate coherent outputs for LRLs, sometimes supported by cross-lingual examples and task-specific instructions.
Adaptive Learning: Fine-tuning pre-trained models on specific LRL datasets demonstrates potential for leveraging prior knowledge embedded in LLMs to boost LRL performance.

Quantitative Results and Evaluation

The paper underscores the importance of systematic evaluation using standardized metrics to judge the effectiveness of various augmentation techniques. BLEU scores are predominantly employed across studies to measure translation accuracy, highlighting back-translation's success in enhancing LRL model performance for machine translation tasks. Additionally, qualitative assessments through human evaluations provide insights into models' ability to capture nuanced linguistic and cultural aspects of LRLs.

Implications and Future Directions

This systematic review indicates substantial progress in addressing data scarcity for LRLs. However, it also reveals considerable disparities in the representation and resources among LRLs themselves. The review identifies the necessity for universal reporting standards for data availability to facilitate accurate cross-study comparisons.

Looking forward, the paper suggests that further research should focus on developing versatile models that can adapt to multiple downstream tasks beyond translation, such as dialogue generation and question answering. This diversification can potentially democratize access to AI-driven language technologies for speakers of underrepresented languages, enhancing linguistic inclusion in the digital age.

The findings from this review serve as a valuable resource for researchers and developers striving to build equitable AI systems capable of preserving linguistic diversity. Leveraging advanced models and innovative data augmentation techniques can play a pivotal role in empowering LRL speakers and fostering global accessibility to AI technologies.

Markdown Report Issue