Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cheetah: Natural Language Generation for 517 African Languages (2401.01053v3)

Published 2 Jan 2024 in cs.CL

Abstract: Low-resource African languages pose unique challenges for NLP tasks, including natural language generation (NLG). In this paper, we develop Cheetah, a massively multilingual NLG LLM for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster linguistic diversity. We demonstrate the effectiveness of Cheetah through comprehensive evaluations across six generation downstream tasks. In five of the six tasks, Cheetah significantly outperforms other models, showcasing its remarkable performance for generating coherent and contextually appropriate text in a wide range of African languages. We additionally conduct a detailed human evaluation to delve deeper into the linguistic capabilities of Cheetah. The introduction of Cheetah has far-reaching benefits for linguistic diversity. By leveraging pretrained models and adapting them to specific languages, our approach facilitates the development of practical NLG applications for African communities. The findings of this study contribute to advancing NLP research in low-resource settings, enabling greater accessibility and inclusion for African languages in a rapidly expanding digital landscape. We publicly release our models for research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim. Leibniz-Institut für Deutsche Sprache.
  2. Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
  3. Serengeti: Massively multilingual language models for africa.
  4. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
  5. The effect of domain and diacritics in Yoruba–English neural machine translation. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 61–75, Virtual. Association for Machine Translation in the Americas.
  6. Machine translation for african languages: Community creation of datasets and models in uganda.
  7. Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  8. Tilman Becker. 2002. Practical, template–based natural language generation with TAG. In Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6), pages 80–83, Universitá di Venezia. Association for Computational Linguistics.
  9. Parser evaluation over local and non-local deep dependencies in a large corpus. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 397–408, Edinburgh, Scotland, UK. Association for Computational Linguistics.
  10. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  11. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211.
  13. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  14. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  15. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  16. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  17. Afrolm: A self-active learning-based multilingual pretrained language model for 23 african languages.
  18. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  19. Ondřej Dušek and Filip Jurčíček. 2015. Training a natural language generator from unaligned data. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Beijing, China. Association for Computational Linguistics.
  20. Ethnologue: Languages of the world. Twenty-fourth edition, Dallas, Texas: SIL International.
  21. Assessing composition in sentence vector representations. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1790–1801, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  22. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics.
  23. GEMv2: Multilingual NLG benchmarking in a single line of code. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 266–281, Abu Dhabi, UAE. Association for Computational Linguistics.
  24. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6098–6111, Hong Kong, China. Association for Computational Linguistics.
  25. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  26. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.
  27. Philip J. Jaggar. 2017. The Hausa “Grade 5” verb: Morphosyntactic preliminaries, 1 edition, pages 18–27. Harrassowitz Verlag.
  28. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  29. AfriTeVA: Extending ?small data? pretraining approaches to sequence-to-sequence models. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 126–135, Hybrid. Association for Computational Linguistics.
  30. Don’t say what you don’t know: Improving the consistency of abstractive summarization by constraining beam search. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 555–571, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  31. Quality at a glance: An audit of web-crawled multilingual datasets. arXiv preprint arXiv:2103.12028.
  32. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  33. IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  34. Statistics-based lexical choice for NLG from quantitative information. In Proceedings of the 9th International Natural Language Generation conference, pages 104–108, Edinburgh, UK. Association for Computational Linguistics.
  35. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics.
  36. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  37. Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
  38. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  39. Crosslingual generalization through multitask finetuning.
  40. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  41. Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. KinyaBERT: a morphology-aware Kinyarwanda language model. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363, Dublin, Ireland. Association for Computational Linguistics.
  42. Kelechi Ogueji and Orevaoghene Ahia. 2019. Pidginunmt: Unsupervised neural machine translation from west african pidgin to english. arXiv preprint arXiv:1912.03444.
  43. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  44. Hemant Palivela. 2021. Optimization of paraphrase generation and identification using language models in natural language processing. International Journal of Information Management Data Insights, 1(2):100025.
  45. Improving language understanding by generative pre-training.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  48. AfroMT: Pretraining strategies and reproducible benchmarks for translation of 8 african languages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic.
  49. AfroMT: Pretraining strategies and reproducible benchmarks for translation of 8 African languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1306–1320, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  50. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota. Association for Computational Linguistics.
  51. Yves Scherrer. 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6868–6873, Marseille, France. European Language Resources Association.
  52. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  53. Neural abstractive text summarization with sequence-to-sequence models. ACM/IMS Trans. Data Sci., 2(1).
  54. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 3104–3112. MIT Press.
  55. Real versus template-based natural language generation: A false opposition? Comput. Linguist., 31(1):15–24.
  56. Evaluation rules! on the use of grammars and rule-based systems for NLG evaluation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 17–27, Online (Dublin, Ireland). Association for Computational Linguistics.
  57. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
  58. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ife Adebara (12 papers)
  2. AbdelRahim Elmadany (33 papers)
  3. Muhammad Abdul-Mageed (102 papers)
Citations (1)