Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines (2404.06101v1)

Published 9 Apr 2024 in cs.CL

Abstract: Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. (2022). Leveraging multilingual news websites for building a Kurdish parallel corpus. Transactions on Asian and Low-Resource Language Information Processing, 21(5):1–11.
  2. (2004). The lifecycle of a digital historical document: structure and content. In Proceedings of the 2004 ACM Symposium on Document Engineering, pages 147–154.
  3. (2007). Matching Ottoman words: an image retrieval approach to historical document indexing. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pages 341–347.
  4. Aula, L. (2021). Improvement of optical character recognition on scanned historical documents using image processing.
  5. (2017). anyOCR: An open-source OCR system for historical archives. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 305–310. IEEE.
  6. (2017). Optical character recognition with a neural network model for printed coptic texts. In Digital Humanities 2017 Conference Abstracts, pages 657–9.
  7. Doğru, M. (2016). Ottoman-Turkish Optical Character Recognition and Latin Transcription. Ph.D. thesis, Ankara Yıldırım Beyazıt ”̈Universitesi Fen Bilimleri Enstit”̈us”̈u.
  8. (2021). Ottoman OCR: Printed Naskh font. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pages 1–5. IEEE.
  9. (2015). Gaussian process style transfer mapping for historical chinese character recognition. In Document Recognition and Retrieval XXII, volume 9402, pages 104–115. SPIE.
  10. (2021). An end-to-end optical character recognition approach for ultra-low-resolution printed text images. arXiv preprint arXiv:2105.04515.
  11. Google. (2023a). How to train lstm/neural net tesseract. Accessed on 30-04-2023.
  12. Google. (2023b). Improving the quality of the output. Accessed on 15-04-2023.
  13. (2016). Automatic Kurdish dialects identification. Computer Science & Information Technology, 6(2):61–78.
  14. Hassanpour, A. (1992). Nationalism and language in Kurdistan, 1918-1985. San Francisco: Mellen Research University Press.
  15. (2021). Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR. Applied Sciences, 11(20):9752.
  16. Idrees, S. (2020). Improving Document Processing in the Public Organizations of the Kurdistan Region of Iraq (KRI) Using an Optical Character Recognition (OCR) System. Master Thesis.
  17. (2008). Multifont ottoman character recognition using support vector machine. In 2008 3rd international symposium on communications, control and signal processing, pages 328–333. IEEE.
  18. (2017). How to improve optical character recognition of historical finnish newspapers using open source tesseract ocr engine. Proc. of LTC, pages 279–283.
  19. Küçükşahin, N. (2019). Design of an Offline Ottoman Character Recognition System for Translating Printed Documents to Modern Turkish. Ph.D. thesis, Izmir Institute of Technology (Turkey).
  20. (2014). Historical chinese character recognition method based on style transfer mapping. In 2014 11th IAPR International Workshop on Document Analysis Systems, pages 96–100. IEEE.
  21. (2020). An attention-based row-column encoder-decoder model for text recognition in japanese historical documents. Pattern Recognition Letters, 136:134–141.
  22. (2022). Optical character recognition for printed tamizhi documents using deep neural networks. DESIDOC Journal of Library & Information Technology, 42(4).
  23. (2017). A holistic technique for an arabic ocr system. Journal of Imaging, 4(1):6.
  24. (2017). Attempts to recognize anomalously deformed kana in japanese historical documents. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, pages 31–36.
  25. (2016). A tesseract-based ocr framework for historical documents lacking ground-truth text. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3269–3273. IEEE.
  26. (2000). Multifont ottoman character recognition. In ICECS 2000. 7th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 00EX445), volume 2, pages 945–949. IEEE.
  27. (2020). A tool for facilitating ocr postediting in historical documents. arXiv preprint arXiv:2004.11471.
  28. (2010). An ocr system for greek printed early books based on computational geometry algorithms.
  29. Qania, M. (2012). Le Barey Ragayandinewe. Chwarchra.
  30. (2018). State of the art optical character recognition of 19th century fraktur scripts using open source engines. arXiv preprint arXiv:1810.03436.
  31. (2021). Optical character recognition of 19th century classical commentaries: the current state of affairs. In The 6th International Workshop on Historical Document Imaging and Processing, pages 1–6.
  32. Shafii, M. (2014). Optical character recognition of printed persian/arabic documents.
  33. (2020). Precise detection of chinese characters in historical documents with deep reinforcement learning. Pattern Recognition, 107:107503.
  34. (2015). Recognition of historical greek polytonic scripts using lstm networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 766–770. IEEE.
  35. (2021). Ocr processing of swedish historical newspapers using deep hybrid cnn–lstm networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 190–198.
  36. (2016). Automatic quality evaluation and (semi-) automatic improvement of ocr models for historical printings. arXiv preprint arXiv:1606.05157.
  37. (2018). Ground truth for training ocr engines on historical documents in german fraktur and early modern latin. arXiv preprint arXiv:1809.05501.
  38. (2016). Qatip–an optical character recognition system for arabic heritage collections in libraries. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pages 168–173. IEEE.
  39. (2019). Degraded historical document binarization: A review on issues, challenges, techniques, and future directions. Journal of Imaging, 5(4):48.
  40. (2008). A complete optical character recognition methodology for historical documents. In 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pages 525–532. IEEE.
  41. (2018). Recognition of chinese text in historical documents with page-level annotations. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 199–204. IEEE.
  42. (2015). Binarization-free OCR for historical documents using lstm networks. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1121–1125. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Blnd Yaseen (1 paper)
  2. Hossein Hassani (26 papers)

Summary

We haven't generated a summary for this paper yet.