PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System (2206.03001v2)

Published 7 Jun 2022 in cs.CV

Abstract: Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1. Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and accuracy, we proposed a practical ultra lightweight OCR system (PP-OCR), and an optimized version PP-OCRv2. In order to further improve the performance of PP-OCRv2, a more robust OCR system PP-OCRv3 is proposed in this paper. PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2. For text detector, we introduce a PAN module with large receptive field named LK-PAN, a FPN module with residual attention mechanism named RSE-FPN, and DML distillation strategy. For text recognizer, the base model is replaced from CRNN to SVTR, and we introduce lightweight text recognition network SVTR LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM to accelerate the model and improve the effect. Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed. All the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.

PDF Abstract

PP-OCRv3 (Li et al., 2022 ) is the third iteration in a series of practical, ultra-lightweight OCR systems developed by Baidu. Building upon the foundations of PP-OCR (Du et al., 2020 ) and PP-OCRv2 (Du et al., 2021 ), PP-OCRv3 introduces nine new strategies aimed at further improving accuracy while maintaining or enhancing computational efficiency, making it suitable for deployment in resource-constrained environments like mobile devices.

The system retains the common two-stage OCR pipeline: text detection followed by text recognition on the detected text line images. Image rectification is also included as an intermediate step. PP-OCRv3 focuses on optimizing both the detection and recognition components.

For text detection, PP-OCRv3 continues to use the DB (Differentiable Binarization) [DB] algorithm framework but enhances it through three key strategies:

CML Distillation: The training process utilizes Collaborative Mutual Learning (CML) distillation, where two student models learn from a more powerful teacher model and also from each other. PP-OCRv3 optimizes both the teacher and student networks within this framework.
LK-PAN (Large Kernel PAN): Integrated into the teacher model, this uses larger convolution kernels (specifically 9x9) in the Path Augmentation Network (PAN) to increase the receptive field. This helps in detecting large fonts and texts with extreme aspect ratios more effectively.
DML (Deep Mutual Learning): Applied during the teacher model training, DML improves the teacher's performance by having two teacher models learn mutually.
RSE-FPN (Residual Squeeze-and-Excitation FPN): Incorporated into the student model's Feature Pyramid Network (FPN), RSE-FPN replaces standard convolutions with RSEConv blocks, which combine Squeeze-and-Excitation (SE) blocks with a residual structure. This enhances feature representation capability while mitigating potential issues with SE blocks in lightweight networks with limited channels.

For text recognition, PP-OCRv3 significantly upgrades the base network from a CNN-RNN-CTC structure to a lightweight version of SVTR [Du2022SVTRST], a Transformer-based model, and introduces six optimization strategies:

SVTR-LCNet: This is the core lightweight recognition network. It combines the Transformer-based SVTR-Tiny with a lightweight CNN, PP-LCNet [cui2021pplcnet]. Recognizing that the original SVTR-Tiny was slow on CPU, the authors replaced the initial layers with PP-LCNet stages and restructured the SVTR Global Mix Blocks (reducing their number and repositioning them after a pooling layer) to improve speed while retaining accuracy advantages over purely CNN-based models. The input image height was also increased from 32 to 48 pixels for better accuracy.
GTC (Guided Training of CTC by Attention): An attention module is used to guide the training of the CTC decoder. This fusion of features helps improve accuracy. Crucially, the attention module is removed during inference, adding no overhead to prediction speed.
TextConAug (Data Augmentation): Inspired by contrastive learning methods, this strategy augments training data by concatenating parts of different text line images. This enriches contextual information and increases data diversity.
TextRotNet: A self-supervised pre-trained model initialized using a large corpus of unlabeled text line data via a rotation prediction task (similar to STR-Fewer-Labels [baek2021STRfewerlabels]). This pre-training helps SVTR-LCNet converge better and improves accuracy.
U-DML (Unified-Deep Mutual Learning): Adopted from PP-OCRv2, this distillation strategy is applied to the recognition model. For SVTR-LCNet (which internally combines CNN and Transformer components), supervision is applied simultaneously to features from different modules (PP-LCNet, SVTR, Attention output) via mutual learning, improving accuracy without increasing model size.
UIM (Unlabeled Images Mining): A simple pseudo-labeling technique where a high-precision recognition model is used to predict labels for a large set of unlabeled images. High-confidence predictions are then added to the training data for the lightweight SVTR-LCNet model.

Experiments are conducted on large custom datasets including real scene and synthetic images for both detection and recognition. Evaluation metrics include Hmean for detection and the end-to-end system, and Sentence Accuracy for recognition, measured on CPU and GPU.

Ablation studies demonstrate the effectiveness of each proposed strategy. For detection, LK-PAN and DML significantly boost the teacher model's Hmean, and integrating RSE-FPN into the student model improves its standalone performance. CML distillation with the optimized teacher further enhances the student model, reaching an 85.4% Hmean for the detection component. For recognition, SVTR-LCNet achieves accuracy close to the PP-OCRv2 recognizer (trained with U-DML) but is faster. Adding GTC, TextConAug, TextRotNet, U-DML, and UIM incrementally improves recognition accuracy from 74.0% to 79.4%.

The final PP-OCRv3 system, combining the improved detector and recognizer, achieves an Hmean of 62.9% on the end-to-end evaluation dataset, which is 5.3% higher than PP-OCRv2 (57.6%) with comparable CPU inference speed (331ms vs. 330ms) and a 22% speedup on a T4 GPU (87ms vs. 111ms). The total model size for PP-OCRv3 is 15.6M, slightly larger than PP-OCRv2's 11.6M but still ultra-lightweight.

PP-OCRv3 provides a state-of-the-art, ultra-lightweight OCR solution that significantly improves accuracy over its predecessor while maintaining high efficiency. All models and code are open-sourced within the PaddleOCR GitHub repository, facilitating practical implementation and deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Chenxia Li (12 papers)
Weiwei Liu (51 papers)
Ruoyu Guo (14 papers)
Xiaoting Yin (14 papers)
Kaitao Jiang (2 papers)
Yongkun Du (9 papers)
Yuning Du (25 papers)
Lingfeng Zhu (2 papers)
Baohua Lai (11 papers)
Xiaoguang Hu (18 papers)
Dianhai Yu (37 papers)
Yanjun Ma (29 papers)

Citations (82)

View on Semantic Scholar

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System (2206.03001v2)

Related Papers