PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System (2109.03144v2)

Published 7 Sep 2021 in cs.CV

Abstract: Optical Character Recognition (OCR) systems have been widely used in various of application scenarios. Designing an OCR system is still a challenging task. In previous work, we proposed a practical ultra lightweight OCR system (PP-OCR) to balance the accuracy against the efficiency. In order to improve the accuracy of PP-OCR and keep high efficiency, in this paper, we propose a more robust OCR system, i.e. PP-OCRv2. We introduce bag of tricks to train a better text detector and a better text recognizer, which include Collaborative Mutual Learning (CML), CopyPaste, Lightweight CPUNetwork (LCNet), Unified-Deep Mutual Learning (U-DML) and Enhanced CTCLoss. Experiments on real data show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost. It is also comparable to the server models of the PP-OCR which uses ResNet series as backbones. All of the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.

PDF Abstract

PP-OCRv2 is a practical ultra lightweight OCR system that aims to improve the accuracy of the previous PP-OCR system while maintaining high efficiency, especially for deployment on CPU platforms. The paper introduces a collection of enhancement strategies ("bag of tricks") for both the text detection and text recognition components of the OCR pipeline.

The system follows the typical OCR pipeline: text detection, detected box rectification, and text recognition. PP-OCRv2 retains successful strategies from PP-OCR (such as using DB for detection and CRNN for recognition) and adds new ones to boost performance.

Key enhancement strategies introduced in PP-OCRv2 include:

For Text Detection:

Collaborative Mutual Learning (CML): This method addresses limitations in traditional knowledge distillation where improvements are limited if the teacher and student accuracies are similar or their structures are very different. CML involves two student networks learning from each other (using Deep Mutual Learning) and simultaneously being guided by a teacher network.
- Implementation: The teacher network typically uses a more robust backbone (like ResNet18), while student networks use a lightweight backbone (like MobileNetV3 large scale 0.5). The teacher's parameters are frozen. The student models are optimized using a combined loss function:
  
  $Loss_{total} = Loss_{gt} + Loss_{dml} + Loss_{distill}$
  
  where $Loss_{gt}$ is the standard supervised loss (DB loss in this case), $Loss_{dml}$ is the KL divergence between the outputs of the two student networks, and $Loss_{distill}$ is a loss that penalizes the difference between the student's output and the (potentially dilated) teacher's output.
- Practical Impact: CML enables the student model to potentially surpass the teacher's accuracy, leading to a more robust lightweight detector.
CopyPaste: A data augmentation technique borrowed from object detection.
- Implementation: Text instances from foreground images are copied and pasted onto randomly selected background images. Care is taken to avoid overlapping pasted texts.
- Practical Impact: This technique helps balance the ratio of positive (text) and negative (non-text) samples in the training data, which is particularly beneficial for text detection and improves robustness to diverse scenes.

For Text Recognition:

Lightweight CPU Network (PP-LCNet): A new lightweight backbone specifically designed for better accuracy-speed trade-offs on Intel CPUs, improving upon MobileNetV1.
- Implementation: Key modifications include:
  - Using H-Swish activation function instead of ReLU.
  - Strategically placing SE (Squeeze-and-Excitation) modules near the tail of the network where they are more effective and have less inference time overhead on CPUs.
  - Using larger $5 \times 5$ convolution kernels at the tail of the network.
  - Adding a larger dimensional (1280-dimensional) $1 \times 1$ convolution layer after Global Average Pooling (GAP) to enhance feature combination capability.
- Practical Impact: PP-LCNet offers improved accuracy over MobileNetV3-based models at a faster inference speed on CPUs, making it suitable for efficient recognition.
Unified-Deep Mutual Learning (U-DML): An adaptation of Deep Mutual Learning for text recognition, enhanced with feature map supervision.
- Implementation: Two networks with identical structures (using PP-LCNet as backbone) are trained from scratch simultaneously. The loss function comprises three parts:
  
  $Loss_{total} = Loss_{ctc} + Loss_{dml} + Loss_{feat}$
  
  where $Loss_{ctc}$ is the standard CTC loss (used for CRNN), $Loss_{dml}$ is the KL divergence between the soft outputs of the two networks, and $Loss_{feat}$ is an L2 loss between the intermediate feature maps (backbone outputs) of the two networks. Feature map transformation is not needed due to identical architectures.
- Practical Impact: U-DML improves recognition accuracy through mutual learning and intermediate feature consistency, without requiring a pre-trained, larger teacher model.
- Additional Implementation Detail: The CTC-Head is modified to use two fully connected layers instead of one for better decoding performance.
Enhanced CTCLoss: A modified loss function to handle the ambiguity of visually similar characters, common in Chinese OCR.
- Implementation: It combines the standard CTCLoss with a form of CenterLoss [wen2016discriminative]:
  
  $L = L_{ctc} + \lambda * L_{center}$
  
  where $L_{center} =\sum_{t=1}^T||x_{t} - c_{y_{t}||_{2}^{2}$, $x_t$ is the feature at timestamp $t$ , and $c_{y_t}$ is the center for the predicted class $y_t$ . A greedy decoding strategy is used to obtain the pseudo-label $y_t = argmax(W * x_t)$ , where $W$ are the CTC head parameters.
- Practical Impact: This loss helps to pull features of the same character class closer in the feature space, improving discrimination between similar characters and boosting recognition accuracy.

Experimental Results & System Performance:

Experiments were conducted on large Chinese datasets combining real and synthetic data for both detection and recognition (97k images for detection training, 17.9M images for recognition training). System-level evaluation used 300 real application images.
Ablation studies confirm the effectiveness of each proposed trick individually and combined. CML + CopyPaste significantly improved detection Hmean. PP-LCNet, U-DML, and Enhanced CTCLoss collectively boosted recognition accuracy.
Comparing the full PP-OCRv2 mobile system with PP-OCR mobile: PP-OCRv2 achieves 7.3% higher Hmean (0.576 vs 0.503) with slightly increased model size (11.6M vs 8.1M) but similar or slightly faster inference time on both CPU (330ms vs 356ms) and GPU (111ms vs 116ms).
PP-OCRv2 mobile performance is comparable to the much larger PP-OCR server model (0.576 Hmean for v2 mobile vs 0.570 Hmean for server) while being significantly more efficient (11.6M size vs 155.1M, much faster inference).

The authors open-sourced all models and code in the PaddleOCR repository, powered by PaddlePaddle.

In summary, PP-OCRv2 successfully improves upon PP-OCR by introducing a suite of carefully designed techniques, including advanced distillation (CML, U-DML), effective data augmentation (CopyPaste), a CPU-optimized network backbone (PP-LCNet), and a recognition-specific loss enhancement (Enhanced CTCLoss). These strategies lead to a more accurate and robust OCR system that remains highly efficient for practical deployment on lightweight platforms, achieving performance comparable to larger server models.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Yuning Du (25 papers)
Chenxia Li (12 papers)
Ruoyu Guo (14 papers)
Cheng Cui (15 papers)
Weiwei Liu (51 papers)
Jun Zhou (370 papers)
Bin Lu (19 papers)
Yehua Yang (2 papers)
Qiwen Liu (7 papers)
Xiaoguang Hu (18 papers)
Dianhai Yu (37 papers)
Yanjun Ma (29 papers)

Citations (51)

View on Semantic Scholar

PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System (2109.03144v2)

Related Papers