Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge (2310.10050v1)

Published 16 Oct 2023 in cs.CV, cs.CL, econ.GN, and q-fin.EC

Abstract: Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into LLM training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned LLM. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Xcit: Cross-covariance image transformers. Advances in neural information processing systems, 34.
  2. Handwriting transformers. Proceedings of the IEEE/CVF international conference on computer vision, pages 1086–1094.
  3. Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
  4. Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162.
  5. Efficient ocr for building a diverse digital history. arXiv preprint arXiv:2304.02737.
  6. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
  7. American stories: A large-scale structured text dataset of historical u.s. newspapers.
  8. Svtr: Scene text recognition with a single visual model. arXiv preprint arXiv:2205.00159.
  9. Searching for mobilenetv3. Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324.
  10. JaidedAI. 2021. Easyocr. https://github.com/JaidedAI/EasyOCR.
  11. Glenn Jocher. 2020. YOLOv5 by Ultralytics.
  12. The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095.
  13. Trocr github repository. https://github.com/microsoft/unilm/tree/master/trocr.
  14. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282.
  15. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527.
  16. Library of Congress. 2022. Chronicling America: Historic American Newspapers.
  17. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030.
  18. A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986.
  19. Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics, 9:479–483.
  20. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision, pages 3–20. Springer.
  21. Survey of post-ocr processing approaches. ACM Comput. Surv., 54(6).
  22. John Mark Ockerbloom. 2019. Newspaper copyrights, notices, and renewals.
  23. ONNX. 2021. Onnx runtime. https://www.onnxruntime.ai. Version: x.y.z.
  24. J Ooms. 2023. Tesseract: Open source ocr engine.
  25. PaddlePaddle. 2022. PaddleOCR.
  26. A large dataset of historical japanese documents with complex layouts. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 548–549.
  27. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304.
  28. Teikoku Koshinjo. 1957. Teikoku Ginko Kaisha Yoroku. Teikoku Koshinjo.
  29. Ultalytics. 2023. Yolo v8 github repository. https://github.com/ultralytics/ultralytics.
  30. Assessing the impact of ocr quality on downstream nlp tasks. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH,, pages 484–496. INSTICC, SciTePress.
  31. Ross Wightman. 2019. Pytorch image models. https://github.com/rwightman/pytorch-image-models.
  32. Detectron2. https://github.com/facebookresearch/detectron2.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tom Bryan (4 papers)
  2. Jacob Carlson (6 papers)
  3. Abhishek Arora (12 papers)
  4. Melissa Dell (17 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com