Automatic Restoration of Diacritics for Speech Data Sets (2311.10771v2)
Abstract: Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.
- Gheith Abandah and Asma Abdel-Karim. 2020. Accurate and fast recurrent neural network solution for the automatic diacritization of arabic text. Jordanian Journal of Computers and Information Technology, 6(2).
- Diacritics effect on arabic speech recognition. Arabian Journal for Science and Engineering, 44:9043–9056.
- Tuka Al Hanai and James R Glass. 2014. Lexical modeling for arabic asr: a systematic approach. In INTERSPEECH, pages 2605–2609.
- Arabic diacritization using bidirectional long short-term memory neural networks with conditional random fields. IEEE Access, 8:154984–154996.
- Hanan Aldarmaki and Ahmad Ghannam. 2023. Diacritic recognition performance in arabic asr. arXiv preprint arXiv:2302.14022.
- Deep diacritization: Efficient hierarchical recurrence for improved arabic diacritization. arXiv preprint arXiv:2011.00538.
- Efficient convolutional neural networks for diacritic restoration. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1442–1448, Hong Kong, China. Association for Computational Linguistics.
- Unsupervised data selection for tts: Using arabic broadcast news as a case study. arXiv preprint arXiv:2301.09099.
- Zerrouki Barqawi. 2017. Shakkala, arabic text vocalization.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Arabic text diacritization using deep neural networks. In 2019 2nd international conference on computer applications & information security (ICCAIS), pages 1–7. IEEE.
- Neural Arabic text diacritization: State of the art results and a novel approach for machine translation. In Proceedings of the 6th Workshop on Asian Translation, pages 215–225, Hong Kong, China. Association for Computational Linguistics.
- Amany Fashwan and Sameh Alansary. 2016. A rule based method for adding case ending diacritics for modern standard arabic texts. In 16th International Conference on Language Engineering. The Egyptian Society of Language Engineering (ESOLE).
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36.
- Clartts: An open-source classical arabic text-to-speech corpus. arXiv preprint arXiv:2303.00069.
- Qasr: Qcri aljazeera speech resource a large scale annotated arabic speech corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2274–2285.
- Camelira: An arabic multi-dialect morphological disambiguator. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 319–326.
- Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Lrec, volume 14, pages 1094–1101.
- QCRI. 2020. Farasa api diacritization module. Accessed on October 12, 2022.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
- URL. 2023. Ali-soft. Accessed on October 12, 2023.
- Attention is all you need. CoRR, abs/1706.03762.
- Dimitra Vergyri and Katrin Kirchhoff. 2004. Automatic diacritization of arabic for acoustic modeling in speech recognition. In Proceedings of the workshop on computational approaches to Arabic script-based languages, pages 66–73.
- Taha Zerrouki. 2020. Towards an open platform for arabic language processing.
- Taha Zerrouki and Amar Balla. 2017. Tashkeela: Novel corpus of arabic vocalized texts, data for auto-diacritization systems. Data in brief, 11:147–151.