MasakhaNEWS: News Topic Classification for African languages (2304.09972v2)
Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several LLMs. Furthermore, we explore several alternatives to full fine-tuning of LLMs that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting LLMs (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
- David Ifeoluwa Adelani (59 papers)
- Marek Masiak (2 papers)
- Israel Abebe Azime (16 papers)
- Jesujoba Alabi (11 papers)
- Atnafu Lambebo Tonja (27 papers)
- Christine Mwase (3 papers)
- Odunayo Ogundepo (11 papers)
- Bonaventure F. P. Dossou (30 papers)
- Akintunde Oladipo (7 papers)
- Doreen Nixdorf (2 papers)
- Chris Chinenye Emezue (15 papers)
- Blessing Sibanda (8 papers)
- Davis David (7 papers)
- Lolwethu Ndolela (4 papers)
- Jonathan Mukiibi (10 papers)
- Tunde Ajayi (2 papers)
- Tatiana Moteu (2 papers)
- Brian Odhiambo (1 paper)
- Abraham Owodunni (5 papers)
- Nnaemeka Obiefuna (2 papers)