Lugha-Llama: Adapting Large Language Models for African Languages (2504.06536v1)

Published 9 Apr 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have achieved impressive results in a wide range of natural language applications. However, they often struggle to recognize low-resource languages, in particular African languages, which are not well represented in large training corpora. In this paper, we consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages. On the challenging IrokoBench dataset, our models consistently achieve the best performance amongst similarly sized baselines, particularly on knowledge-intensive multiple-choice questions (AfriMMLU). Additionally, on the cross-lingual question answering benchmark AfriQA, our models outperform the base model by over 10%. To better understand the role of English data during training, we translate a subset of 200M tokens into Swahili language and perform an analysis which reveals that the content of these data is primarily responsible for the strong performance. We release our models and data to encourage future research on African languages.

PDF Abstract

Adapting LLMs for African Languages: An Overview of Lugha-Llama

The paper "Lugha-Llama: Adapting LLMs for African Languages" by Buzaaba et al. presents a comprehensive paper on the adaptation of LLMs to cater to the linguistic diversity inherent in African languages, which are often underrepresented in conventional LLM training corpora. This research is pivotal considering that Africa hosts a substantial proportion of the world's linguistic diversity, yet its languages remain low-resource in the context of available digital data.

Methodological Framework

The authors introduce a tailored approach to enhance the performance of LLMs on African languages through the development of the Lugha-Llama models. These models are an extension of the Llama-3.1-8B architecture, further refined using 10 billion tokens with a strategic selection of multilingual data from the WURA corpus—an assemblage of African and some high-resource languages—augmented by high-quality English texts. The data composition strategies are carefully designed employing UniMax sampling to ensure a balanced representation of languages.

Three models were constructed: Lugha-Llama utilizing solely the WURA corpus, Lugha-Llama-edu incorporating high-quality English educational texts, and Lugha-Llama-math including mathematical documents. The inclusion of English content aims to leverage the richer data quality available in English resources.

Empirical Insights

The Lugha-Llama models were evaluated on the IrokoBench, AfriQA, and other benchmarks, which assess performance across various linguistic and reasoning tasks. The Lugha-Llama models consistently outperformed the baselines, particularly in tasks requiring deep linguistic comprehension such as AfriMMLU and AfriQA. Notably, the introduction of high-quality English data improved the Lugha-Llama’s performance, indicating that even in multilingual models, the quality of the training data can significantly influence outcomes more than the language of the data per se.

The empirical findings are illuminating: the Lugha-Llama models displayed marked improvements over baseline models such as Llama-3.1 and other African-centric models like AfroLlama and InkubaLM. Specifically, the model's performance enhancement on the IrokoBench dataset suggests superior adaptability to knowledge-intensive queries in African languages when supplemented with curated English data.

Theoretical and Practical Implications

Buzaaba et al.'s research enters crucial discussions about multilingual and cross-lingual transfer in LLMs. The observed performance gains suggest significant implications for both the methodological and practical aspects of LLMing. Theoretically, the paper supports the potential of curated multilingual corpora augmented by high-quality data in a major language (English) to enhance performance on lower-resource languages. Moreover, this work underlines the challenges posed by data scarcity and quality disparities across languages, reinforcing the need for methodical data curation approaches.

Practically, the release of the Lugha-Llama models and newly created datasets could catalyze research into African language processing, thus promoting linguistic inclusivity in AI technologies. These models provide a foundation for developing applications that cater to the native languages spoken by millions across Africa, thereby directly impacting information accessibility and computational linguistics research in underrepresented languages.

Future Directions

The research underscores promising avenues for future work in the AI domain, particularly in large-scale machine translation to mitigate data quality gaps and further adaptation techniques for multilingual models to better support low-resource languages. Further comparative analyses against high-resource LLMs like GPT-4 are proposed to refine cross-lingual generalization insights.

In summary, this paper brings forth a pertinent and data-driven approach to enhancing LLM performance on African languages, advocating for an equitable representation for low-resource languages in the NLP landscape. The Lugha-Llama model's success sets a precedent for leveraging high-resource data to uplift the representation and effectiveness of language technologies in minority languages, a significant stride towards linguistic inclusivity in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Happy Buzaaba (9 papers)
Alexander Wettig (21 papers)
David Ifeoluwa Adelani (59 papers)
Christiane Fellbaum (5 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos