TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (2109.10282v5)

Published 21 Sep 2021 in cs.CL and cs.CV

Abstract: Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another LLM is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at \url{https://aka.ms/trocr}.

View on arXiv

Authors (9)

Minghao Li (44 papers)
Tengchao Lv (17 papers)
Jingye Chen (16 papers)
Lei Cui (43 papers)
Yijuan Lu (11 papers)
Dinei Florencio (17 papers)
Cha Zhang (23 papers)
Zhoujun Li (122 papers)
Furu Wei (291 papers)

Citations (277)

View on Semantic Scholar

Summary

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

The paper entitled "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" introduces a novel approach to text recognition using the Transformer architecture, capitalizing on pre-trained models from both computer vision (CV) and NLP domains. The authors propose an end-to-end framework for optical character recognition (OCR) that eliminates the dependency on conventional convolutional neural networks (CNNs) and recurrent neural networks (RNNs), traditionally used for image understanding and char-level text generation.

Core Contributions

Model Architecture: TrOCR employs the Transformer architecture for both encoding visual features and decoding text sequences. The authors utilize pre-trained models—specifically, visual Transformers like DeiT and BEiT for encoding image patches and text Transformers like RoBERTa for decoding text. This approach diverges from standard OCR models, which typically incorporate CNNs and RNNs.
Pre-training Strategy: The paper describes a two-stage pre-training strategy. Initially, the model is pre-trained using a large dataset of synthetic printed text, followed by a second stage involving more specific datasets, including synthetic handwritten and scene text, to refine the model for targeted tasks. This strategy enables the effective utilization of unlabeled data to enhance model performance across different OCR tasks.
Performance and Evaluation: Experimental results are presented across multiple benchmark datasets, including SROIE, IAM Handwriting Database, and various scene text datasets. The TrOCR models achieve state-of-the-art performance, demonstrating superior accuracy in recognizing printed, handwritten, and scene text without relying on complex processing modules or external LLMs.

Evaluation and Results

The paper meticulously evaluates the TrOCR system over several datasets, reflecting impressive gains over conventional methods. On the SROIE dataset, the TrOCR models outperformed existing state-of-the-art models by leveraging the image Transformer for visual feature extraction and the text Transformer for context-aware LLMing, without any auxiliary LLM post-processing. Notably, the TrOCR models achieved a Character Error Rate (CER) of 2.89 on the IAM Handwriting Database, matching or surpassing methods reliant on additional human-labeled data or external LLMs. The results on scene text datasets further solidify the paper's claims, achieving state-of-the-art performance in several categories.

Implications

The TrOCR model presents notable implications for the OCR domain. By leveraging the transformer architecture in both the encoding and decoding processes, the model simplifies the traditional OCR pipeline, reducing the dependency on independent LLMs and CNN-based feature extraction methods. This simplification not only streamlines implementation but also suggests a unified framework that can be adapted to other language and text recognition tasks with minimal modifications.

Future Work

The scope of future developments based on this research could include extending this architecture to support multilingual OCR by integrating multilingual text Transformers. Furthermore, considering the promising results achieved with pre-trained Transformers, exploring self-supervised pre-training approaches specific to OCR tasks could boost the performance and adaptability of such models in more diverse and complex environments, such as recognizing scripts in low-resource languages or text in heavily distorted images.

In conclusion, the deployment of Transformers in both visual and text domains within the TrOCR framework marks a significant shift in OCR model design, highlighting the potential for further innovations in text recognition and related fields. The findings of this paper could influence subsequent research, aiming towards more generalized and efficient recognition systems by embracing pre-trained, scalable models.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos