Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction (2110.01661v5)
Abstract: Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.
- B. Alex and J. Burns. Estimating and rating the quality of optically character recognised text. ACM International Conference Proceeding Series, 2014.
- Prediction of ocr accuracy using simple image features. page 319, 1995.
- N-gram-based text categorization. Ann Arbor MI, 1994.
- J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46, 1960.
- D. Doermann and K. Tombre. Handbook of Document Image Processing and Recognition. Springer Publishing Company, Incorporated, 2014.
- Automatic assessment of ocr quality in historical documents. page 1735–1741, 2015.
- M. Hill and S. Hengchen. Quantifying the impact of dirty ocr on historical text analysis: Eighteenth century collections online as a case study. Digit. Scholarsh. Humanit., 34:825–843, 2019.
- A. Kay. Tesseract: an open-source optical character recognition engine. Linux Journal, 2007.
- Ocr accuracy prediction method based on blur estimation. pages 317–322, 2016. 10.1109/DAS.2016.50.
- S. Kulp and K. April. On retrieving legal files: Shortening documents and weeding out garbage. Special Publication 500-274, 2007.
- V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8–17, 1965.
- Noise characterization for historical documents with physical distortions. 11353:77–87, 2020. 10.1117/12.2559694.
- M. Lui and T. Baldwin. Langid.py: An off-the-shelf language identification tool. pages 25–30, 2012.
- Y. Maurer. Improving the quality of the text, a pilot project to assess and correct the ocr in a multilingual environment. Relying on News Media. Long Term Preservation and Perspectives for Our Collective Memorey, 2017.
- Document image ocr accuracy prediction via latent dirichlet allocation. pages 771–775, 2015.
- Ocr performance prediction using cross-ocr alignment. pages 556–560, 2015. 10.1109/ICDAR.2015.7333823.
- R. Schaefer and C. Neudecker. A two-step approach for automatic OCR post-correction. pages 52–57, 2020.
- Learning surrogate models of document image quality metrics for automated document image processing. pages 67–72, 2018. 10.1109/DAS.2018.14.
- Automatic quality evaluation and (semi-) automatic improvement of ocr models for historical printings. arXiv: Digital Libraries, 2016.
- Assessing the impact of ocr quality on downstream nlp tasks. 2020.
- Automatic removal of garbage strings in ocr text: An implementation. The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, 2001.
- C.J. Willmott and K. Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate Research, 30(1):79–82, 2005.
- Recognizing garbage in ocr output on historical documents. Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, 2011. 10.1145/2034617.2034626.
- G.K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949.