A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing (2307.04245v1)
Abstract: Optical Character Recognition (OCR) technology finds applications in digitizing books and unstructured documents, along with applications in other domains such as mobility statistics, law enforcement, traffic, security systems, etc. The state-of-the-art methods work well with the OCR with printed text on license plates, shop names, etc. However, applications such as printed textbooks and handwritten texts have limited accuracy with existing techniques. The reason may be attributed to similar-looking characters and variations in handwritten characters. Since these issues are challenging to address with OCR technologies exclusively, we propose a post-processing approach using NLP tools. This work presents an end-to-end pipeline that first performs OCR on the handwritten or printed text and then improves its accuracy using NLP.
- J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and documents: a survey,” International Journal of Document Analysis and Recognition (IJDAR), vol. 7, pp. 84–104, 2005.
- S. Bansal, M. Gupta, and A. K. Tyagi, “Building a character recognition system for vehicle applications,” in Advances in Decision Sciences, Image Processing, Security and Computer Vision: International Conference on Emerging Trends in Engineering (ICETE), Vol. 1. Springer, 2020, pp. 161–168.
- M. Li, T. Lv, L. Cui, Y. Lu, D. A. F. Florêncio, C. Zhang, Z. Li, and F. Wei, “Trocr: Transformer-based optical character recognition with pre-trained models,” CoRR, vol. abs/2109.10282, 2021. [Online]. Available: https://arxiv.org/abs/2109.10282
- Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, Z. Yu, Y. Yang, Q. Dang, and H. Wang, “PP-OCR: A practical ultra lightweight OCR system,” CoRR, vol. abs/2009.09941, 2020. [Online]. Available: https://arxiv.org/abs/2009.09941
- P. Norvig, “How to write a spelling corrector,” De: http://norvig. com/spell-correct. html, 2007.
- M. Lansley, S. Kapetanakis, and N. Polatidis, “Seader++ v2: detecting social engineering attacks using natural language processing and machine learning,” in 2020 International Conference on Innovations in Intelligent Systems and Applications (INISTA). IEEE, 2020, pp. 1–6.
- “Segmenting lines in handwritten documents using a* path planning algorithm,” https://muthu.co/segmenting-lines-in-handwritten-documents-using-a-path-planning-algorithm/.
- “Printed vs. handwritten text lines – automatically separated,” https://readcoop.eu/printed-vs-handwritten-text-lines-automatically-separated/.
- “Overview - born-digital images (web and email),” https://rrc.cvc.uab.es/?ch=1&com=downloads.
- C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol. 1. IEEE, 2017, pp. 935–942.
- “License plate characters - detection ocr,” https://www.kaggle.com/datasets/francescopettini/license-plate-characters-detection-ocr.
- “English handwritten line dataset,” https://www.kaggle.com/datasets/sushant097/english-handwritten-line-dataset.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raffel, “Byt5: Towards a token-free future with pre-trained byte-to-byte models,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 291–306, 2022.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
- E. Ma, “Nlp augmentation,” https://github.com/makcedward/nlpaug, 2019.
- J. Abadji, P. Ortiz Suarez, L. Romary, and B. Sagot, “Towards a Cleaner Document-Oriented Multilingual Crawled Corpus,” arXiv e-prints, p. arXiv:2201.06642, Jan. 2022.
- S. T. Piantadosi, “Zipf’s word frequency law in natural language: A critical review and future directions,” Psychonomic bulletin & review, vol. 21, pp. 1112–1130, 2014.
- [Online]. Available: https://github.com/keredson/wordninja
- Aishik Rakshit (2 papers)
- Samyak Mehta (1 paper)
- Anirban Dasgupta (32 papers)