InkubaLM: A small language model for low-resource African languages (2408.17024v2)

Published 30 Aug 2024 in cs.CL

Abstract: High-resource LLMs often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small LLM with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective LLMs must rely on substantial resources. Our model and datasets are publicly available at https://huggingface.co/lelapa to encourage research and development on low-resource languages.

PDF HTML Abstract

InkubaLM: A Model for Low-Resource African Languages

This essay provides an exhaustive overview of the paper "InkubaLM: A small LLM for low-resource African languages," delineating the research’s context, methodology, and results while addressing its implications for future research in NLP.

Introduction

LLMs have redefined the landscape of NLP, enabling groundbreaking advancements in various applications such as machine translation, sentiment analysis, and conversational AI. However, most LLMs are predominantly trained on high-resource languages like English and Mandarin, leaving low-resource languages, particularly many languages within Africa, underrepresented. The primary challenge lies in the scarcity of quality datasets and the computational resources needed for developing effective LLMs tailored to these languages.

"InkubaLM," a 0.4 billion parameter multilingual model, addresses this disparity by providing a small yet effective LLM that outperforms larger ones in certain NLP tasks. This paper explores the development, evaluation, and implications of InkubaLM for low-resource African languages.

Background and Related Work

Traditional approaches to NLP in low-resource settings often involve multilingual models and cross-lingual transfer learning. Models such as Multilingual BERT (mBERT) and XLM-R leverage shared linguistic representations to enhance performance. Furthermore, specialized models like AfriBERTa have demonstrated that "small data" approaches can outperform more generalized models. Despite these efforts, challenges like linguistic bias and ethical considerations persist.

InkubaLM builds on the foundational works of smaller models, incorporating innovative techniques such as Flash Attention for efficiency, contributing to the body of research focused on making NLP models more inclusive and effective for underrepresented languages.

Methodology

Dataset Construction

Two datasets were constructed: Inkuba-Mono for monolingual pre-training and Inkuba-Instruct for fine-tuning on instruction-based tasks.

Inkuba-Mono: This dataset consists of 2.4 billion tokens across five African languages (Hausa, isiZulu, isiXhosa, Swahili, and Yoruba) alongside English and French. Datasets were normalized and tokenized using Byte Pair Encoding (BPE).
Inkuba-Instruct: This dataset facilitated instruction fine-tuning for tasks like machine translation and sentiment analysis, combining multilingual sources to support task-specific performance.

Model Architecture

InkubaLM utilizes a decoder-only architecture with 0.4 billion parameters. Enhanced with Flash Attention and trained using Fully Sharded Data Parallel (FSDP), the model emphasizes computational efficiency. Training covered five African languages, with data insights driving multilingual tokenization.

Evaluation

The model was evaluated across several benchmarks:

Sentiment Analysis

In zero-shot settings, InkubaLM exhibited superior performance in Swahili (42.47 F1) and competitive results in Hausa and Yoruba, reinforcing its capability in sentiment analysis compared to larger models like SmoLLM and MobiLlama.

Machine Translation

Evaluations were conducted for both directions (English to African languages and vice versa). InkubaLM achieved impressive results in isiZulu, underscoring its effectiveness. However, performance varied across languages, necessitating further refinement.

AfriMMLU and AfriXNLI

InkubaLM achieved notable F1 scores across AfriMMLU and AfriXNLI tasks, demonstrating consistent and balanced performance across African languages. While larger models like Gemma-7B and BLOOMZ-7B surpassed it in absolute numbers, InkubaLM held its ground, showcasing the robustness of its architecture.

Implications and Future Work

InkubaLM’s development underscores the potential to create efficient, resource-effective LLMs for low-resource settings, setting a precedent for further research in this domain. Several key implications emerge:

Resource Efficiency: InkubaLM’s performance relative to its size demonstrates that smaller models, if well-tuned, can substantially balance efficiency and capacity.
Fairness and Bias: By incorporating representations of low-resource languages, InkubaLM contributes to reducing linguistic bias inherent in many high-resource-based models.
Scalability and Accessibility: The open-source release of InkubaLM and its datasets paves the way for wider adoption and further research, encouraging development tailored to local contexts.

In future developments, expanding the model's linguistic capabilities and refining its architectural efficiencies could push the boundaries of what's achievable with limited resources. Advances in transfer learning, data augmentation, and more tailored fine-tuning strategies promise further enhancements.

Conclusion

The paper presents InkubaLM as a viable solution for empowering African communities in NLP tasks through a smaller, efficient LLM. By addressing the challenges posed by low-resource settings and introducing innovative training and architectural methodologies, InkubaLM sets a new benchmark in multilingual LLMing, opening avenues for future research and practical applications in low-resource languages.