Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts (2304.03427v2)
Abstract: Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). https://doi.org/10.48550/arXiv.1409.0473
- A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Florence, Italy, 213–227. https://doi.org/W19-4423
- Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1631–1640.
- Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora. arXiv preprint arXiv:2305.17906 (2023).
- Robin Jia and Percy Liang. 2016. Data Recombination for Neural Semantic Parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12–22.
- Approaching neural grammatical error correction as a low-resource machine translation task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 595–606. https://doi.org/N18-1055
- Trocr: transformer-based optical character recognition with pre-trained models. arXiv 2021. arXiv preprint arXiv:2109.10282 (2021).
- Corpora generation for grammatical error correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3291–3301. https://doi.org/N19-1333
- A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing. In 2023 IEEE Guwahati Subsection Conference (GCON). IEEE, 01–06.
- Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1073–1083.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725. https://doi.org/P16-1162
- HAZ Shahgir and Khondker Salman Sayeed. 2023. Bangla Grammatical Error Detection Using T5 Transformer Model. arXiv preprint arXiv:2303.10612 (2023).
- A survey of deep learning approaches for ocr and document understanding. arXiv preprint arXiv:2011.13534 (2020).
- Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014). https://doi.org/10.48550/arXiv.1409.3215
- Attention is all you need. Advances in neural information processing systems 30 (2017). https://doi.org/10.48550/arXiv.1706.03762
- Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 156–165. https://doi.org/N19-1014