Normalization of Lithuanian Text Using Regular Expressions
Abstract: Text Normalization is an integral part of any text-to-speech synthesis system. In a natural language text, there are elements such as numbers, dates, abbreviations, etc. that belong to other semiotic classes. They are called non-standard words (NSW) and need to be expanded into ordinary words. For this purpose, it is necessary to identify the semiotic class of each NSW. The taxonomy of semiotic classes adapted to the Lithuanian language is presented in the work. Sets of rules are created for detecting and expanding NSWs based on regular expressions. Experiments with three completely different data sets were performed and the accuracy was assessed. Causes of errors are explained and recommendations are given for the development of text normalization rules.
- A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, pages 33–40, 07 2006. doi:10.3115/1273073.1273078.
- L. Balčiūnas. Context based number normalization using skip-chain conditional random fields. In IVUS 2019, volume 2470, pages 17–21, Aachen, 2019. CEUR-WS. URL: http://ceur-ws.org/Vol-2470/p7.pdf.
- S. Beliga and S. Martincic-Ipsic. Text normalization for croatian speech synthesis. In 2011 Proceedings of the 34th International Convention MIPRO, pages 1664–1669, 06 2011.
- Multiple model text normalization for the polish language. In L. Chen, A. Felfernig, J. Liu, and Z. W. Raś, editors, Foundations of Intelligent Systems, volume 7661, pages 143–148, Berlin, Heidelberg, 12 2012. Springer Berlin Heidelberg. doi:10.1007/978-3-642-34624-8_17.
- O. D. Cherepanova. Text normalization in russian text-to-speech synthesis: Taxonomy and processing of non-standard words. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2017", volume 1, pages 42–53, 2017. URL: https://www.dialog-21.ru/media/3906/cherepanovaod.pdf.
- A multi-lingual tn/itn framework for speech technology. Language, 10:213–216, 01 2010. URL: http://lorien.die.upm.es/~lapiz/rtth/JORNADAS/VI/pdfs/0046.pdf.
- S. Chodnicki. Everything you need to know about regular expressions. Last accessed 4 October 2022, 2019. URL: https://towardsdatascience.com/everything-you-need-to-know-about-regular-expressions-8f622fe10b03.
- Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), 10(3):157–174, 2007. doi:10.1007/s10032-007-0054-0.
- T. Dutoit. High-quality text-to-speech synthesis: An overview. Journal of Electrical and Electronics Engineering, Australia, 17:25–36, 1997.
- P. Ebden and R. Sproat. The kestrel tts text normalization system. Natural Language Engineering, 21(3):333–353, 2014. doi:10.1017/S1351324914000175.
- A text normalisation system for non-standard English words. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 107–115, Copenhagen, Denmark, 2017. Association for Computational Linguistics. URL: https://aclanthology.org/W17-4414, doi:10.18653/v1/W17-4414.
- B. Gerazov and Z. Ivanovski. Text normalization and phonetic analysis modules for macedonian tts synthesis. In 2011 19thTelecommunications Forum (TELFOR) Proceedings of Papers, pages 671–674, 2011. doi:10.1109/TELFOR.2011.6143636.
- Grammar of Modern Lithuanian. Mokslo ir enciklopedijų leidykla, Vilnius, 1996. in Lithuanian.
- G. Grigas. Can we read abbreviations? Lietuvos aidas, Nr. 192(10582), 28 August 2008, in Lithuanian, 2008. URL: http://ims.mii.lt/ims/asmen/gintas/publ/gg08-santrumpos.html.
- Text normalization algorithm on twitter in complaint category. Procedia Computer Science, 116:20–26, 2017. doi:10.1016/j.procs.2017.10.004.
- Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River, NJ, 2001.
- D. Jurafsky and J. Martin. Speech and language processing. Draft of December 29, 2021. URL: https://web.stanford.edu/~jurafsky/slp3/2.pdf.
- D. Pennell and Y. Liu. A character-level machine translation approach for normalization of sms abbreviations. IJCNLP, pages 974–982, 2011. URL: https://aclanthology.org/I11-1109.pdf.
- A. Sikdar and N. Chatterjee. An improved bayesian trie based model for sms text normalization. In K. Arai, editor, Intelligent Computing, pages 579–593, Cham, 2022. Springer International Publishing. doi:10.1007/978-3-031-10464-0_39.
- Normalization of non-standard words. Computer Speech & Language, 15(3):287–333, 2001. URL: https://www.sciencedirect.com/science/article/pii/S088523080190169X, doi:10.1006/csla.2001.0169.
- R. Sproat and N. Jaitly. Rnn approaches to text normalization: A challenge, 2017. doi:10.48550/arXiv.1611.00068v2.
- P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, Cambridge, 2009. doi:10.1017/CBO9780511816338.
- A. Utka and D. Amilevičius. Normalisation of lithuanian social media texts: Towards morphological analysis of user-generated comments. In NormSoMe Workshop at LREC 2016, pages 39–44, 2016. URL: http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf.
- D. van Esch and R. Sproat. An expanded taxonomy of semiotic classes for text normalization. In Proc. Interspeech 2017, pages 4016–4020, 2017. doi:10.21437/Interspeech.2017-402.
- R. Vladarskienė and P. Zemlevičiūtė. Spelling of the Lithuanian language. Rules, comments, tips. Lithuanian Language Institute publishing house, Vilnius, 2022. in Lithuanian. URL: https://vlkk.lt/media/public/file/Nutarimai/Rašyba_2022.pdf.
- Text normalization for the pronunciation of non-standard words in an inflected language. In G. A. Vouros and Th. Panayiotopoulos, editors, Methods and Applications of Artificial Intelligence, pages 390–399, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. doi:10.1007/978-3-540-24674-9_41.
- Neural Models of Text Normalization for Speech Applications. Computational Linguistics, 45(2):293–337, 2019. doi:10.1162/coli_a_00349.
- A. Šešplaukis. The Martynas Mažvydas Catechism. The First Lithuanian Book in the Light of New Research. Lituanus. Lithuanian Quarterly Jornal of Arts and Sciences, 19(3), 1973. URL: https://www.lituanus.org/1973/73_3_01.htm.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.