Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EthioMT: Parallel Corpus for Low-resource Ethiopian Languages (2403.19365v1)

Published 28 Mar 2024 in cs.CL

Abstract: Recent research in NLP has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Extended parallel corpus for amharic-english machine translation. arXiv preprint arXiv:2104.03543.
  2. Neural machine translation for amharic-english translation. In ICAART (1), pages 526–532.
  3. Beyond english-centric multilingual machine translation. The Journal of Machine Learning Research, 22(1):4839–4886.
  4. Attention is all you need. Advances in neural information processing systems, 30.
  5. Evaluating amharic machine translation. arXiv preprint arXiv:2003.14386.
  6. Parallel corpus for indigenous language translation: Spanish-mazatec and spanish-mixtec. arXiv preprint arXiv:2305.17404.
  7. A parallel corpora for bi-directional neural machine translation for low resourced ethiopian languages. In 2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 71–76. IEEE.
  8. Improving neural machine translation for low resource languages using mixed training: The case of ethiopian languages. In Mexican International Conference on Artificial Intelligence, pages 30–40. Springer.
  9. Natural language processing in ethiopian languages: Current state, challenges, and opportunities. arXiv preprint arXiv:2303.14406.
  10. Azeb Amha. 2017. The omotic language family. Cambridge University Press.
  11. Benjamin Philip King. 2015. Practical Natural Language Processing for Low-Resource Languages. Ph.D. thesis.
  12. Bernard Comrie. 2002. Languages of the world: who speaks what. In An encyclopedia of language, pages 529–543. Routledge.
  13. Dorothy Kenny. 2018. Machine translation. In The Routledge handbook of translation and philosophy, pages 428–445. Routledge.
  14. Machine learning approach to english-afaan oromo text-text translation: Using attention based neural machine translation. In 2021 4th International Conference on Computing and Communications Technologies (ICCCT), pages 80–85. IEEE.
  15. Edmund L Epstein and Robert Kole. 1998. The language of African literature. Africa World Press.
  16. Linguistic features and typologies in languages commonly referred to as ‘nilo-saharan’. Cambridge Handbook of African Languages, pages 326–381.
  17. Progress in machine translation. Engineering.
  18. Hirut Woldemariam. 2007. The challenges of mother-tongue education in ethiopia: The case of north omo area. Language Matters, 38(2):210–235.
  19. Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
  20. An exploration of data augmentation techniques for improving english to tigrinya translation. arXiv preprint arXiv:2103.16789.
  21. Statistical machine translator for english to tigrigna translation. Int. J. Sci. Technol. Res, 9(1):2095–2099.
  22. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  23. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  24. Million Meshesha and Yitayew Solomon. 2018. English-afaan oromo statistical machine translation. International Journal of Computational Linguistic (IJCL), 9(1).
  25. Mulu Gebreegziabher Teshome and Laurent Besacier. 2012. Preliminary experiments on english-amharic statistical machine translation. In Spoken Language Technologies for Under-Resourced Languages.
  26. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  27. Webcrawl african: A multilingual parallel corpora for african languages. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1076–1089.
  28. Unsung challenges of building and deploying language technologies for low resource language communities. arXiv preprint arXiv:1912.03457.
  29. The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095.
  30. Sisay Adugna and Andreas Eisele. 2010. English—oromo machine translation: An experiment using a statistical approach. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).
  31. Crowdsourcing parallel corpus for english-oromo neural machine translation using community engagement platform. arXiv preprint arXiv:2102.07539.
  32. English-ethiopian languages statistical machine translation. In Proceedings of the 2019 Workshop on Widening NLP, pages 27–30.
  33. Low resource neural machine translation: A benchmark for five african languages. arXiv preprint arXiv:2003.14402.
  34. Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
  35. The effect of normalization for bi-directional amharic-english neural machine translation. In 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 84–89. IEEE.
  36. Jaap Van der Meer. 2019. Translation technology–past, present and future. The Bloomsbury companion to language industry studies, pages 285–310.
  37. Context based machine translation with recurrent neural network for english–amharic translation. Machine Translation, 35(1):19–36.
  38. Yemane Tedla and Kazuhide Yamamoto. 2016. The effect of shallow segmentation on english-tigrinya statistical machine translation. In 2016 International Conference on Asian Language Processing (IALP), pages 79–82. IEEE.
  39. Yemane Tedla and Kazuhide Yamamoto. 2017. Morphological segmentation for english-to-tigrinya statistical machinetranslation. Int. J. Asian Lang. Process, 27(2):95–110.
  40. Optimal alignment for bi-directional afaan oromo-english statistical machine translation. vol, 3:73–77.
  41. Yohanens Biadgligne and Kamel Smaïli. 2021. Parallel corpora preparation for english-amharic machine translation. In Advances in Computational Intelligence: 16th International Work-Conference on Artificial Neural Networks, IWANN 2021, Virtual Event, June 16–18, 2021, Proceedings, Part I 16, pages 443–455. Springer.
  42. Yohannes Biadgligne and Kamel Smaïli. 2022. Offline corpus augmentation for english-amharic machine translation. In 2022 5th International Conference on Information and Computer Technologies (ICICT), pages 128–135. IEEE.
  43. Enhancing bi-directional english-tigrigna machine translation using hybrid approach. In Norsk IKT-konferanse for forskning og utdanning, 1.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Atnafu Lambebo Tonja (27 papers)
  2. Olga Kolesnikova (24 papers)
  3. Alexander Gelbukh (52 papers)
  4. Jugal Kalita (64 papers)