Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation (2403.13737v3)

Published 20 Mar 2024 in cs.CL

Abstract: LLMs have gained popularity recently due to their outstanding performance in various downstream NLP tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual LLMs for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual LLMs, new benchmark datasets for various downstream tasks, and task-specific fine-tuned LLMs and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Exploring amharic hate speech data collection and classification approaches. In Proceedings of the 14th International Conference on RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING (RANLP 2023, pages 59–59.
  2. The 5Js in Ethiopia: Amharic hate speech data annotation using Toloka Crowdsourcing Platform. In 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 114–120, Bahir Dar, Ethiopia.
  3. Transferring monolingual model to low-resource language: The case of tigrinya.
  4. Ahmed Alsayat. 2022. Improving sentiment analysis for social media applications using an ensemble deep learning language model. Arabian Journal for Science and Engineering, 47(2):2499–2511.
  5. Classification of fake news by fine-tuning deep bidirectional transformers based language model. EAI Endorsed Transactions on Scalable Information Systems, 7(27):e10–e10.
  6. Alebachew Chiche and Betselot Yitagesu. 2022. Part of speech tagging: a systematic review of deep learning and machine learning approaches. Journal of Big Data, 9(1):1–25.
  7. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  8. Extended parallel corpus for Amharic-English machine translation. arXiv preprint arXiv:2104.03543.
  9. An interactive exploratory tool for the task of hate speech detection. In Proceedings of the Second Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 11–20, Seattle, Washington. Association for Computational Linguistics.
  10. Sentiment analysis of twitter data. In Proceedings of the workshop on language in social media (LSM 2011), pages 30–38.
  11. Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. Sentiment analysis using deep learning architectures: a review. Artificial Intelligence Review, 53(6):4335–4385.
  12. Attention is all you need. Advances in neural information processing systems, 30:5998–6008.
  13. Natural language processing in ethiopian languages: Current state, challenges, and opportunities. arXiv preprint arXiv:2303.14406.
  14. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 14867–14875, Palo Alto, CA, USA. Association for the Advancement of Artificial Intelligence.
  15. Björn Gambäck and Utpal Kumar Sikdar. 2017. Named entity recognition for amharic using deep learning. In 2017 IST-Africa Week Conference (IST-Africa), pages 1–8. IEEE.
  16. AfroLM: A self-active learning-based multilingual pretrained language model for 23 African languages. arXiv preprint arXiv:2211.03263.
  17. New trends in machine translation using large language models: Case examples with chatgpt. arXiv preprint arXiv:2305.01181.
  18. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  19. Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. Phobert: Pre-trained language models for vietnamese. arXiv preprint arXiv:2003.00744.
  20. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
  21. Masakhaner: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  22. Masakhanews: News topic classification for african languages. arXiv preprint arXiv:2304.09972.
  23. Ethnologue: Languages of the World. Twenty-third edition. Dallas, Texas: SIL International. Url: http://www.ethnologue.com.
  24. Ebrahim Chekol Jibril and A Cüneyd Tantuğ. 2023. Anec: An amharic named entity corpus and transformer based recognizer. IEEE Access, 11:15799–15815.
  25. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
  26. Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.
  27. Hailemariam Mehari Yohannes and Toshiyuki Amagasa. 2022. A scheme for news article classification in a low-resource language. In International Conference on Information Integration and Web, pages 519–530. Springer.
  28. Ibrahim Gashaw and H L Shashirekha. 2020. Machine learning approaches for amharic parts-of-speech tagging. arXiv preprint arXiv:2001.03324.
  29. Serengeti: Massively multilingual language models for africa. arXiv preprint arXiv:2212.10785.
  30. Israel Abebe Azime and Nebil Mohammed. 2021. An amharic news text classification dataset. In 2nd AfricaNLP Workshop Proceedings, AfricaNLP@EACL 2021, Virtual Event, April 19, 2021.
  31. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  32. BERT: Pre-training of deep bidirectional transformers for language understanding. pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  33. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349.
  34. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126.
  35. Improving POS tagging of German learner language in a reading comprehension scenario. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 198–205, Portorož, Slovenia. European Language Resources Association (ELRA).
  36. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  37. Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210.
  38. Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2):267–307.
  39. No language left behind: Scaling human-centered machine translation.
  40. Mikel L Forcada. 2017. Making sense of neural machine translation. Translation spaces, 6(2):291–309.
  41. Nlnde at semeval-2023 task 12: Adaptive pretraining and source language selection for low-resource multilingual sentiment analysis. arXiv preprint arXiv:2305.00090.
  42. Mulu Gebreegziabher Teshome and Laurent Besacier. 2012. Preliminary experiments on English-Amharic statistical machine translation. In Spoken Language Technologies for Under-Resourced Languages, pages 36–41, Cape Town, South Africa.
  43. Phoneme-based English-Amharic statistical machine translation. In AFRICON 2015, pages 1–5, Addis Ababa, Ethiopia. IEEE.
  44. Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4675–4684, Hong Kong, China. Association for Computational Linguistics.
  45. Multilingual lama: Investigating knowledge in multilingual pretrained language models. arXiv preprint arXiv:2102.00894.
  46. AfriTeVA: Extending ?small data? pretraining approaches to sequence-to-sequence models. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 126–135, Hybrid. Association for Computational Linguistics.
  47. Introducing various semantic models for amharic: Experimentation and evaluation with multiple tasks and datasets. Future Internet, 13(11):275.
  48. Naijasenti: A nigerian twitter sentiment corpus for multilingual sentiment analysis. arXiv e-prints, pages arXiv–2201.
  49. Afrisenti: A twitter sentiment analysis benchmark for african languages. arXiv preprint arXiv:2302.08956.
  50. Semeval-2023 task 12: Sentiment analysis for african languages (afrisenti-semeval). arXiv preprint arXiv:2304.06845.
  51. SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics.
  52. Fake news detection using deep learning. In 2020 IEEE 10th symposium on computer applications & industrial electronics (ISCAIE), pages 102–107. IEEE.
  53. Parallel corpora for bi-lingual English-Ethiopian languages statistical machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3102–3111, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  54. Surafel Getachew Tesfaye and Kula Kakeba. 2020. Automated amharic hate speech posts and comments detection model using recurrent neural network. Preprint. Version 1.
  55. Tadesse Ambaye and Mekuria Yared. 2000. English to Amharic machine translation using statistical machine translation. Master’s thesis.
  56. Question answering classification for Amharic social media community based questions. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages.
  57. The effect of normalization for bi-directional amharic-english neural machine translation. In 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 84–89. IEEE.
  58. Teshome Mulugeta Ababu and Michael Melese Woldeyohannis. 2022. Afaan Oromo hate speech detection and classification on social media. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6612–6619, Marseille, France. European Language Resources Association.
  59. Automated hate speech detection and the problem of offensive language. In Proceedings of the Eleventh International AAAI Conference on Web and Social Media, volume 11, pages 512–515, Montréal, QC, Canada. Association for Computational Linguistics.
  60. Amqa: Amharic question answering dataset. arXiv preprint arXiv:2303.03290.
  61. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  62. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
  63. Context based machine translation with recurrent neural network for English–Amharic translation. Machine Translation, 35(1):19–36.
  64. Yohanens Biadgligne and Kamel Smaïli. 2021. Parallel corpora preparation for English-Amharic machine translation. In International Work-Conference on Artificial Neural Networks, pages 443–455. Springer, Cham.
  65. Design and implementation of a multichannel convolutional neural network for hate speech detection in social networks. Revue d’Intelligence Artificielle, 36(2):175–183.
  66. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Atnafu Lambebo Tonja (27 papers)
  2. Israel Abebe Azime (16 papers)
  3. Tadesse Destaw Belay (12 papers)
  4. Mesay Gemeda Yigezu (8 papers)
  5. Moges Ahmed Mehamed (3 papers)
  6. Abinew Ali Ayele (17 papers)
  7. Ebrahim Chekol Jibril (3 papers)
  8. Michael Melese Woldeyohannis (1 paper)
  9. Olga Kolesnikova (24 papers)
  10. Philipp Slusallek (27 papers)
  11. Dietrich Klakow (114 papers)
  12. Shengwu Xiong (31 papers)
  13. Seid Muhie Yimam (41 papers)
Citations (1)