OCR-free Document Understanding Transformer (2111.15664v5)

Published 30 Nov 2021 in cs.LG and cs.AI

Abstract: Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. The code, trained model and synthetic data are available at https://github.com/clovaai/donut.

PDF Abstract

Overview of the OCR-free Document Understanding Transformer

The research presented in the paper introduces an innovative approach to Visual Document Understanding (VDU) by proposing an Optical Character Recognition (OCR)-free model termed the Document Understanding Transformer, or Donut. This model addresses the limitations associated with traditional OCR-based methods, which often suffer from high computational costs, lack of flexibility across languages, and error propagation issues.

Key Contributions

The authors make several noteworthy contributions with this work:

Introduction of an OCR-free VDU Model: Donut bypasses the conventional reliance on OCR engines by employing a transformer-based architecture that maps raw input images directly to structured outputs without intermediate text recognition stages.
Synthetic Data Utilization: A synthetic document generator, SynthDoG, is offered to facilitate pre-training across various languages and domains, demonstrating the model’s adaptability.
State-of-the-Art Performance: Donut achieves superior performance in both speed and accuracy across multiple VDU tasks, evidenced by experiments on public benchmarks and private datasets, establishing its efficacy in real-world applications.
Code and Resources Availability: The launch of an open-source repository including the model, codebase, and synthetic data further supports the reproducibility and extensibility of the research findings.

Methodology

The Donut model operates using a Transformer-based visual encoder-decoder architecture. This approach stands out due to its simplicity and effectiveness:

Encoder: Utilizes Swin Transformer to convert document images into embeddings, enabling robust feature extraction.
Decoder: Employs BART, initialized from a pre-trained multilingual variant, to generate token sequences in a structured format, such as JSON, which represents the desired information extraction output.

Donut employs a pre-train-and-fine-tune paradigm, initially learning through a pseudo-OCR task aimed at reading text, followed by task-specific fine-tuning that facilitates document understanding for various applications like classification, information extraction, and visual question answering.

Experimental Results

The paper reports impressive results across several datasets:

Document Classification: Achieves leading accuracy on the RVL-CDIP dataset, surpassing OCR-dependent models like LayoutLMv2, while maintaining efficient parameter utilization and faster processing speed.
Information Extraction: Demonstrates superior performance on datasets such as CORD and Ticket, with both field-level F1 scores and hierarchical accuracy metrics, showcasing its ability to accurately parse complex document structures.
Document Visual Question Answering: Competes effectively with state-of-the-art models on the DocVQA dataset, excelling particularly in cases involving handwritten documents, which are traditionally more challenging for OCR-dependent pipelines.

Implications and Future Directions

This research presents substantial implications for both theoretical and practical aspects of AI. By eliminating the dependency on OCR systems, Donut offers a more seamless, adaptable approach to document understanding that can easily accommodate multilingual contexts and varied domains.

Future research directions could explore extending pre-training objectives beyond text reading to incorporate combined vision-language tasks, enhancing the model's holistic document comprehension capabilities. Furthermore, integrating efficient attention mechanisms could mitigate computational costs, allowing for scaling to even larger input resolutions without compromising speed.

In conclusion, the OCR-free Document Understanding Transformer sets a promising path forward in the VDU domain, demonstrating that robust document parsing can be achieved with a streamlined, end-to-end approach that obviates traditional bottlenecks associated with OCR dependencies.