AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages (2211.03263v2)

Published 7 Nov 2022 in cs.CL, cs.AI, and cs.LG

Abstract: In recent years, multilingual pre-trained LLMs have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual LLMs requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual LLMs pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual LLM pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained LLMs (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.

PDF Abstract

Overview of AfroLM: A Multilingual Pretrained LLM for African Languages

The paper "AfroLM: A Self-Active Learning-based Multilingual Pretrained LLM for 23 African Languages" presents an innovative approach towards enhancing NLP capabilities in low-resource language settings. Central to this paper is AfroLM, a multilingual LLM designed to support 23 African languages, making it the most extensive model of its kind to date. The research introduces a self-active learning framework that significantly improves the data efficiency and performance of LLMs, particularly for languages where large-scale training datasets are unavailable.

Key Contributions

Self-Active Learning Framework: The novel self-active learning framework allows the model to iteratively expand its training data from a small initial dataset. Unlike traditional active learning where a separate oracle model is used, AfroLM employs the same model for both learning and querying, thus simplifying the training process and reducing computational complexity.
Robust Performance on Downstream Tasks: AfroLM demonstrated superior performance in various NLP tasks, such as named entity recognition (NER), text classification, and sentiment analysis, outperforming earlier models like AfriBERTa, XLM-RoBERTa, and mBERT. This is remarkable given that AfroLM was pretrained on a dataset 14 times smaller than used by these baselines.
Diverse Language Coverage: The model's support for 23 African languages, including widely spoken languages like Swahili and less-resourced ones like Fon and Ghomálá', addresses the linguistic diversity of the African continent. This supports more inclusive technology development, recognizing the linguistic needs of millions of speakers.

Experimental Evaluation and Results

In terms of technical accomplishments, the model showcases its effectiveness through a series of robust experiments:

NER Performance: AfroLM with active learning outperformed AfriBERTa and compared competitively with models trained on significantly larger datasets. Its ability to generalize across different languages and perform well in out-of-domain sentiment analysis tasks further underscores the utility of the self-active learning framework.
Data Efficiency: The research provides empirical evidence that the self-active learning framework leads to substantial performance improvements even with limited data availability, which is a critical consideration for many low-resource languages.

Implications and Future Work

The demonstrated efficacy of AfroLM has profound implications for the development of NLP applications tailored to African languages. By enhancing data efficiency and reducing the reliance on large datasets, this framework could facilitate broader adoption and development of language technology tools for underrepresented languages globally.

For future research, the authors suggest exploring the relationship between the number of active learning rounds and model performance to further fine-tune the efficiency of the framework. Additionally, incorporating a weighted loss function and diverse sample generation techniques could further stabilize and enrich the training process.

Conclusion

This paper makes a significant stride in the field of multilingual LLMing by addressing challenges associated with low-resource settings. The AfroLM model and its underlying self-active learning framework offer a promising pathway for improving NLP tasks across diverse languages with minimal data requirements. Consequently, it represents a vital step towards more equitable access to language technologies for speakers of African languages.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Bonaventure F. P. Dossou (30 papers)
Atnafu Lambebo Tonja (27 papers)
Oreen Yousuf (8 papers)
Salomey Osei (21 papers)
Abigail Oppong (8 papers)
Iyanuoluwa Shode (11 papers)
Oluwabusayo Olufunke Awoyomi (2 papers)
Chris Chinenye Emezue (15 papers)

Citations (41)

View on Semantic Scholar

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages (2211.03263v2)

Overview of AfroLM: A Multilingual Pretrained LLM for African Languages

Key Contributions

Experimental Evaluation and Results

Implications and Future Work

Conclusion

Related Papers

GitHub

YouTube