AfroBench: How Good are Large Language Models on African Languages?

Published 14 Nov 2023 in cs.CL, cs.AI, and cs.LG | (2311.07978v5)

Abstract: Large-scale multilingual evaluations, such as MEGA, often include only a handful of African languages due to the scarcity of high-quality evaluation data and the limited discoverability of existing African datasets. This lack of representation hinders comprehensive LLM evaluation across a diverse range of languages and tasks. To address these challenges, we introduce AfroBench -- a multi-task benchmark for evaluating the performance of LLMs across 64 African languages, 15 tasks and 22 datasets. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task. We present results comparing the performance of prompting LLMs to fine-tuned baselines based on BERT and T5-style models. Our results suggest large gaps in performance between high-resource languages, such as English, and African languages across most tasks; but performance also varies based on the availability of monolingual data resources. Our findings confirm that performance on African languages continues to remain a hurdle for current LLMs, underscoring the need for additional efforts to close this gap. https://mcgill-nlp.github.io/AfroBench/

Abstract PDF Upgrade to Chat

Authors (7)

Citations (14)

View on Semantic Scholar

Summary

The paper presents an analytical evaluation comparing four popular LLMs on 60 African languages across six NLP tasks, revealing significant performance disparities.
The methodology assesses models like mT0, Aya, LLaMa 2, and GPT-4 using tasks such as translation, summarization, and question answering, with mT0 and Aya showing strengths in specific areas.
The findings underscore the need for more inclusive pre-training data and custom tuning to bridge performance gaps in low-resource African languages.

Evaluative Overview of the Performance of LLMs in African Languages

Recent advancements in LLMs have significantly improved the capabilities of these models to perform in-context learning across various tasks and languages. However, the performance of LLMs on African languages has not been as extensively studied as that on high-resource languages. The paper under discussion provides an analytical overview of the capabilities of four popular LLMs—mT0, Aya, LLaMa 2, and GPT-4—across a diverse set of tasks, specifically for African languages, which encompass various language families and geographical regions.

Performance Analysis across Tasks and Languages

The study evaluates the models on six distinct tasks: topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition, across a total of 60 African languages. The results indicate a notable disparity in performance between African languages and high-resource languages, demonstrating a persistent gap that suggests additional research and development efforts are needed for low-resource LLMs.

According to the findings, African languages show lower performance overall, particularly in generative tasks such as machine translation and summarization. Specifically, GPT-4 displays average to good performance on classification tasks, while its capabilities lag significantly on generative tasks. Interestingly, the mT0 model outperformed both GPT-4 and fine-tuned mT5 models in cross-lingual question answering for African languages. Similarly, the recently introduced Aya model presented comparable results to mT0 in most tasks, with notable superiority in topic classification. Conversely, LLaMa 2 consistently demonstrated the weakest performance, likely attributed to its predominantly English and code-centric pre-training corpus.

Implications and Future Directions

The paper highlights actionable insights and implications for further studies and developments in artificial intelligence within the context of African linguistics. The practical implications call for concerted efforts in improving LLMs tuned to African languages by incorporating them more prominently in pre-training datasets. This approach could potentially alleviate the skill gap relative to high-resource LLMs.

Theoretical implications include exploring new methodologies for instruction fine-tuning and multitask learning that better cater to African languages. Future research could explore customizing LLMs to be more inclusive of diverse dialects and linguistic nuances native to the African continent.

In conclusion, the research underscores the need for a more inclusive approach in the development and evaluation of LLMs, with a focus on overcoming challenges posed by African languages. This necessitates ongoing evaluations and iterative improvements to bridge existing performance gaps, ensuring these models benefit users across all linguistic communities. The paper advocates for advancing AI models for African languages, providing a foundation for future initiatives and research endeavors in this domain.

Markdown Report Issue