Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media (2309.12137v1)

Published 21 Sep 2023 in cs.CL and cs.AI

Abstract: While resources for English language are fairly sufficient to understand content on social media, similar resources in Arabic are still immature. The main reason that the resources in Arabic are insufficient is that Arabic has many dialects in addition to the standard version (MSA). Arabs do not use MSA in their daily communications; rather, they use dialectal versions. Unfortunately, social users transfer this phenomenon into their use of social media platforms, which in turn has raised an urgent need for building suitable AI models for language-dependent applications. Existing machine translation (MT) systems designed for MSA fail to work well with Arabic dialects. In light of this, it is necessary to adapt to the informal nature of communication on social networks by developing MT systems that can effectively handle the various dialects of Arabic. Unlike for MSA that shows advanced progress in MT systems, little effort has been exerted to utilize Arabic dialects for MT systems. While few attempts have been made to build translation datasets for dialectal Arabic, they are domain dependent and are not OSN cultural-language friendly. In this work, we attempt to alleviate these limitations by proposing an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects: Gulf, Yemeni, Iraqi, and Levantine. To perform the translation, we followed our proposed guideline framework for content translation, which could be universally applicable for translation between foreign languages and local dialects. We validated the authenticity of our proposed dataset by developing neural MT models for four Arabic dialects. Our results have shown a superior performance of our NMT models trained using our dataset. We believe that our dataset can reliably serve as an Arabic multidialectal translation dataset for informal MT tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. From arabic sentiment analysis to sarcasm detection: The arsarcasm dataset. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 32–39, 2020.
  2. Multilingual sentiment analysis using machine translation? In Proceedings of the 3rd workshop in computational approaches to subjectivity and sentiment analysis, pages 52–60, 2012.
  3. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cognitive computation, 8(4):757–771, 2016.
  4. Multilingual sentiment analysis for web text based on word to word translation. In 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI), pages 74–79, 2020.
  5. Improving parallel corpus quality for chinese-vietnamese statistical machine translation. Journal of Beijing Institute of Technology, 27(1), 2018.
  6. Roman-urdu-parl: Roman-urdu and urdu parallel corpus for urdu language understanding. Transactions on Asian and Low-Resource Language Information Processing, 21(1):1–20, 2022.
  7. Alphamwe-arabic: Arabic edition of multilingual parallel corpora with multiword expression annotations. 2023.
  8. The practical ethics of bias reduction in machine translation: Why domain adaptation is better than data debiasing. Ethics and Information Technology, pages 1–15, 2021.
  9. Survey of arabic machine translation, methodologies, progress, and challenges. In 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 378–383. IEEE, 2022.
  10. Survey of the arabic machine translation corpora. In Modelling and Implementation of Complex Systems: Proceedings of the 7th International Symposium, MISC 2022, Mostaganem, Algeria, October 30-31, 2022, pages 205–219. Springer, 2022.
  11. Hybrid pipeline for building arabic tunisian dialect-standard arabic neural machine translation model from scratch. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(3):1–21, 2023.
  12. Multilingual spoken language corpus development for communication research. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from ISCSLP 2006, pages 303–324, 2007.
  13. Arabic machine translation: A survey with challenges and future directions. IEEE Access, 9:161445–161468, 2021.
  14. A multidialectal parallel corpus of arabic. In LREC, pages 1240–1245, 2014.
  15. Machine translation of arabic dialects. In Proceedings of the 2012 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 49–59, 2012.
  16. The madar arabic dialect corpus and lexicon. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), 2018.
  17. Neural machine translation of low resource languages: Application to transcriptions of tunisian dialect. In Intelligent Systems and Pattern Recognition: Second International Conference, ISPR 2022, Hammamet, Tunisia, March 24–26, 2022, Revised Selected Papers, pages 234–247. Springer, 2022.
  18. Empirical evaluation of shallow and deep learning classifiers for arabic sentiment analysis. Transactions on Asian and Low-Resource Language Information Processing, 21(1):1–25, 2021.
  19. Arabench: Benchmarking dialectal arabic-english machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5094–5107, 2020.
  20. Dialectal to standard arabic paraphrasing to improve arabic-english statistical machine translation. In Proceedings of the first workshop on algorithms and resources for modelling of dialects and language varieties, pages 10–21, 2011.
  21. Egyptian arabic to english statistical machine translation system for nist openmt’2015. arXiv preprint arXiv:1606.05759, 2016.
  22. Dialectal arabic to english machine translation: Pivoting through modern standard arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 348–358, 2013.
  23. A neural approach to language variety translation. arXiv preprint arXiv:1807.00651, 2018.
  24. A reverse positional encoding multi-head attention-based neural machine translation model for arabic dialects. Mathematics, 10(19):3666, 2022.
  25. Improving neural machine translation for low resource algerian dialect by transductive transfer learning strategy. Arabian Journal for Science and Engineering, 47(8):10411–10418, 2022.
  26. A neural machine translation model for arabic dialects that utilizes multitask learning (mtl). Computational intelligence and neuroscience, 2018, 2018.
  27. Comparing pipelined and integrated approaches to dialectal arabic neural machine translation. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 214–222, 2019.
  28. Gender differences in negative affect and well-being: the case for emotional intensity. Journal of personality and social psychology, 61(3):427, 1991.
  29. Anna Wierzbicka. Emotions across languages and cultures: Diversity and universals. Cambridge University Press, 1999.
  30. Affectional ontology and multimedia dataset for sentiment analysis. In International Conference on Smart Multimedia, pages 15–28. Springer, 2018.
  31. Majid Mohammed Ali Mabkhut Musaad and Adbulrahman Ali Al Towity. Translation evaluation of three machine translation systems, with special references to idiomatic expressions. مجلة العلوم التربوية و الدراسات الإنسانية, (29):678–708, 2023.
  32. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  33. Monitoring cyber sentihate social behavior during covid-19 pandemic in north america. IEEE Access, 2021.
  34. Deep learning (dl)-enabled system for emotional big data. IEEE Access, 9:116073–116082, 2021.
  35. Light gradient boosting machine for general sentiment classification on short texts: A comparative evaluation. IEEE Access, 2020.
  36. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020.
  37. Low-resource neural machine translation: Methods and trends. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(5):1–22, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Fatimah Alzamzami (3 papers)
  2. Abdulmotaleb El Saddik (49 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.