OCR-IDL: OCR Annotations for Industry Document Library Dataset (2202.12985v1)

Published 25 Feb 2022 in cs.CV and cs.AI

Abstract: Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance gain is coming from diverse usage of amount of data and distinct OCR engines or from the proposed models. To remedy the problem, we make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models. The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$. It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence. All of our data and its collection process with the annotations can be found in https://github.com/furkanbiten/idl_data.

Authors (5)

Ali Furkan Biten (17 papers)
Rubèn Tito (12 papers)
Lluis Gomez (42 papers)
Ernest Valveny (28 papers)
Dimosthenis Karatzas (80 papers)

Citations (23)

View on Semantic Scholar

Summary

OCR-IDL: Enhancing Document Intelligence with a Large-Scale OCR Annotated Dataset

The paper "OCR-IDL: OCR Annotations for Industry Document Library Dataset" by Biten et al. introduces a large-scale dataset of OCR annotations for industry documents. This dataset, termed OCR-IDL, aims to standardize the material used for pretraining Document Intelligence models and is intended to mitigate inconsistencies in results due to variable data sources and OCR engines. This essay provides an expert overview of the dataset and its implications for the field.

Introduction

The analysis of complex, varied documents is crucial for various sectors, including law, intelligence, knowledge management, and historical research. Traditional document processing methods, which often require manual customization, are both time-consuming and costly. This challenge has driven the development of Document Intelligence, a multidisciplinary field that seeks to automate the analysis and understanding of documents using advanced models integrating Optical Character Recognition (OCR), document structure analysis, and NLP.

Dataset Overview and Motivation

OCR-IDL is comprised of OCR annotations for 26 million pages derived from the Industry Document Library (IDL) provided by the University of California, San Francisco (UCSF). These annotations were generated using Amazon Textract, a commercial OCR engine, which was chosen for its superior performance over open-source alternatives. The authors argue that inconsistent usage of different OCR engines and varying amounts of data across studies complicate fair comparisons of model architectures. By standardizing the dataset, OCR-IDL facilitates more equitable comparisons and deeper insights into the contributions of new architectures and pretraining strategies.

Comparison to Existing Datasets

OCR-IDL stands out by comparison to other prominent Document Intelligence datasets, such as IIT-CDIP, RVL-CDIP, PublayNet, DocBank, and DocVQA. Notably, OCR-IDL is one of the largest annotated datasets currently available, significantly reducing the noise introduced by OCR errors in pretraining and downstream tasks. Importantly, while OCR-IDL uses industry documents similar to IIT-CDIP and RVL-CDIP, it includes additional types of documents from various industries, thus enhancing its diversity in terms of both content and layout.

Dataset Characteristics and Statistics

OCR-IDL includes documents that span over a century, providing a rich data source that captures a wide range of visual artifacts and printing technologies. The dataset is text-rich, containing 166 million words and 46 million lines. The pages are diverse, featuring a variety of layouts, including figures, tables, lists, and text blocks.

To quantify the diversity of the document layouts, the authors applied Faster-RCNN, trained on the PubLayNet dataset, to segment the documents into different layout components. The results highlight the presence of a significant variety of layouts, with 40% of pages containing at least one figure and a high proportion containing multiple text blocks and titles.

Implications for Document Intelligence Research

The introduction of OCR-IDL represents a pivotal resource for the Document Intelligence field. Pretraining on such a dataset can yield more robust and effective models due to the reduced noise and standardized annotations provided by a high-quality OCR engine. With OCR-IDL, researchers can more accurately evaluate the impact of their proposed architectures, pretraining strategies, and other innovations in Document Intelligence without the confounding factor of varying data quality.

This dataset also enables studies to better isolate the effects of OCR performance, dataset size, and pretraining loss functions on downstream tasks. Future research can leverage OCR-IDL to investigate these variables systematically, leading to deeper insights and potential advancements in document processing techniques.

Conclusion

The OCR-IDL dataset is a significant contribution to the Document Intelligence field, providing high-quality OCR annotations for a large and diverse set of industry documents. This standardization facilitates fairer comparisons and more meaningful insights, fostering advancements in automated document analysis. The dataset, along with its comprehensive annotation process, is publicly accessible, promoting transparency and collaboration within the research community.

In summary, OCR-IDL sets a new benchmark for pretraining datasets in Document Intelligence, offering a robust foundation for future research and development in the field. The provided data can potentially drive significant advancements, enabling the creation of more sophisticated models and enhancing automated document processing capabilities.

Related Papers

Find Related Papers

GitHub

GitHub - furkanbiten/idl_data: OCR Annotations from Amazon Textract for Industry Documents Library (102 stars)

Tweets

https://twitter.com/m_olbap/status/1775201738397765775