Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Context-Contrastive Inference Approach To Partial Diacritization (2401.08919v3)

Published 17 Jan 2024 in cs.CL and cs.LG

Abstract: Diacritization plays a pivotal role in improving readability and disambiguating the meaning of Arabic texts. Efforts have so far focused on marking every eligible character (Full Diacritization). Comparatively overlooked, Partial Diacritzation (PD) is the selection of a subset of characters to be marked to aid comprehension where needed. Research has indicated that excessive diacritic marks can hinder skilled readers -- reducing reading speed and accuracy. We conduct a behavioral experiment and show that partially marked text is often easier to read than fully marked text, and sometimes easier than plain text. In this light, we introduce Context-Contrastive Partial Diacritization (CCPD) -- a novel approach to PD which integrates seamlessly with existing Arabic diacritization systems. CCPD processes each word twice, once with context and once without, and diacritizes only the characters with disparities between the two inferences. Further, we introduce novel indicators for measuring partial diacritization quality, essential for establishing this as a machine learning task. Lastly, we introduce TD2, a Transformer-variant of an established model which offers a markedly different performance profile on our proposed indicators compared to all other known systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Hamza Abbad and Shengwu Xiong. 2020. Multi-components system for automatic arabic diacritization. In Advances in Information Retrieval, pages 341–355, Cham. Springer International Publishing.
  2. How does speed and accuracy in reading relate to reading comprehension in arabic? Psicológica, 35:251–276.
  3. Arabic diacritization using bidirectional long short-term memory neural networks with conditional random fields. IEEE Access, 8:154984–154996.
  4. Deep diacritization: Efficient hierarchical recurrence for improved Arabic diacritization. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 38–48, Barcelona, Spain (Online). Association for Computational Linguistics.
  5. Manar M Almanea. 2021. Automatic methods and neural networks in arabic texts diacritization: a comprehensive survey. IEEE Access, 9:145012–145032.
  6. Rehab Alnefaie and Aqil M. Azmi. 2017. Automatic minimal diacritization of arabic texts. Procedia Computer Science, 117:169–174. Arabic Computational Linguistics.
  7. Homograph disambiguation through selective diacritic restoration. In WANLP@ACL 2019.
  8. Investigating hybrid approaches for arabic text diacritization with recurrent neural networks. 2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), pages 1–6.
  9. Ashaar: Automatic analysis and generation of arabic poetry using deep learning approaches. ArXiv, abs/2307.06218.
  10. Zerrouki Barqawi. 2017. Shakkala, Arabic text vocalization. https://github.com/Barqawiz/Shakkala.
  11. Hybrid approaches for automatic vowelization of arabic texts. ArXiv, abs/1410.2646.
  12. Yonatan Belinkov and James Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2281–2285, Lisbon, Portugal. Association for Computational Linguistics.
  13. Alkhalil morpho sys 2: A robust arabic morpho-syntactic analyzer. Journal of King Saud University - Computer and Information Sciences, 29:141–146.
  14. Arabic diacritic recovery using a feature-rich bilstm model.
  15. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 9–17, Valencia, Spain. Association for Computational Linguistics.
  16. Statistical methods for automatic diacritization of arabic text. The Saudi 18th National Computer Conference. Riyadh, 18:301–306.
  17. How much does lookahead matter for disambiguation? partial arabic diacritization case study. Computational Linguistics, 48:1–22.
  18. Arabic text diacritization using deep neural networks. In 2019 2nd International Conference on Computer Applications Information Security (ICCAIS), pages 1–7.
  19. Neural Arabic text diacritization: State of the art results and a novel approach for machine translation. In Proceedings of the 6th Workshop on Asian Translation, pages 215–225, Hong Kong, China. Association for Computational Linguistics.
  20. Asma Abdel-Karim Gheith Abandah. 2020. Accurate and fast recurrent neural network solution for the automatic diacritization of arabic text. Jordanian Journal of Computers and Information Technology (JJCIT), 06(02):103 – 121.
  21. Andreas Hallberg. 2022. Variation in the use of diacritics in modern typeset standard arabic: A theoretical and descriptive framework. Arabica, 69(3):279 – 317.
  22. Processing of arabic diacritical marks: Phonological-syntactic disambiguation of homographic verbs and visual crowding effects. Journal of experimental psychology. Human perception and performance, 41.
  23. Yasser Hifny. 2018. Hybrid lstm/maxent networks for arabic syntactic diacritics restoration. IEEE Signal Processing Letters, 25(10):1515–1519.
  24. Raphiq Ibrahim. 2013. Reading in arabic: New evidence for the role of vowel signs. Creative Education, 04:248–253.
  25. The interplay of variant, size, and task type in arabic pre-trained language models. arXiv preprint arXiv:2103.06678.
  26. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  27. Ali Al Midhwah and Mohammad T. Alhawary. 2020. Arabic diacritics and their role in facilitating reading speed, accuracy, and comprehension by english l2 learners of arabic. The Modern Language Journal, 104:418–438.
  28. Ali Mijlad and Yacine El Younoussi. 2022. A comparative study of some automatic arabic text diacritization systems. Advances in Human-Computer Interaction, 2022.
  29. Evaluation of gated recurrent unit in arabic diacritization. International Journal of Advanced Computer Science and Applications, 9.
  30. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2390–2395, Minneapolis, Minnesota. Association for Computational Linguistics.
  31. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1094–1101, Reykjavik, Iceland. European Language Resources Association (ELRA).
  32. G. Roman and B. Pavard. 1987. A comparative study: How we read in arabic and french. In J.K. O’Regan and A. Levy-Schoen, editors, Eye Movements from Physiology to Cognition, pages 431–440. Elsevier, Amsterdam.
  33. A hybrid approach for arabic diacritization. In International Conference on Applications of Natural Language to Data Bases.
  34. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pages 27–35, Athens, Greece. Association for Computational Linguistics.
  35. Gijsbert Stoet. 2010. PsyToolkit: A software package for programming psychological experiments using Linux. Behavior Research Methods, 42(4):1096–1104.
  36. Gijsbert Stoet. 2017. Psytoolkit: A novel web-based method for running online questionnaires and reaction-time experiments. Teaching of Psychology, 44(1):24–31.
  37. Haitham Taha. 2016. Deep and shallow in arabic orthography: New evidence from reading performance of elementary school native arab readers. Writing Systems Research, 8(2):133–142.
  38. Huggingface’s transformers: State-of-the-art natural language processing.
  39. Taha Zerrouki and Amar Balla. 2017. Tashkeela: Novel corpus of arabic vocalized texts, data for auto-diacritization systems. Data in Brief, 11:147 – 151.
  40. Imed Zitouni and Ruhi Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech & Language, 23:257–276.
Citations (1)

Summary

We haven't generated a summary for this paper yet.