Self-Supervised Learning for Text Recognition: A Critical Survey
(2407.19889v1)
Published 29 Jul 2024 in cs.CV
Abstract: Text Recognition (TR) refers to the research area that focuses on retrieving textual information from images, a topic that has seen significant advancements in the last decade due to the use of Deep Neural Networks (DNN). However, these solutions often necessitate vast amounts of manually labeled or synthetic data. Addressing this challenge, Self-Supervised Learning (SSL) has gained attention by utilizing large datasets of unlabeled data to train DNN, thereby generating meaningful and robust representations. Although SSL was initially overlooked in TR because of its unique characteristics, recent years have witnessed a surge in the development of SSL methods specifically for this field. This rapid development, however, has led to many methods being explored independently, without taking previous efforts in methodology or comparison into account, thereby hindering progress in the field of research. This paper, therefore, seeks to consolidate the use of SSL in the field of TR, offering a critical and comprehensive overview of the current state of the art. We will review and analyze the existing methods, compare their results, and highlight inconsistencies in the current literature. This thorough analysis aims to provide general insights into the field, propose standardizations, identify new research directions, and foster its proper development.
This paper, "Self-Supervised Learning for Text Recognition: A Critical Survey" (Penarrubia et al., 29 Jul 2024), provides a comprehensive overview and critical analysis of the application of Self-Supervised Learning (SSL) to Text Recognition (TR), covering both Scene Text Recognition (STR) and Handwritten Text Recognition (HTR). The authors aim to consolidate the rapidly evolving field of SSL for TR, review existing methods, compare their performance, and highlight inconsistencies to propose standardizations and future research directions.
Fundamentals of Text Recognition
The paper defines TR as decoding text images into symbolic sequences. Modern TR models predominantly use end-to-end Deep Neural Networks (DNNs) with encoder-decoder architectures.
Encoders: Extract visual features from the input image. Common architectures include Convolutional Recurrent Neural Networks (CRNNs) which use CNNs followed by RNNs and Vision Transformers (ViTs) which process images as sequences of patches using attention mechanisms.
Decoders: Translate the encoded features into a sequence of characters. Popular decoders include Connectionist Temporal Classification (CTC), which aligns variable-length sequences without explicit segmentation; Attention (Att) decoders, typically RNNs using attention over encoder features autoregressively; and Transformer Decoders (TDs), using self-attention and cross-attention mechanisms.
Taxonomy of SSL Methodologies for TR
The survey categorizes existing SSL methods for TR based on general SSL paradigms: Discriminative and Generative.
Discriminative Approaches: These methods learn representations by training models to distinguish between different aspects of the data.
Contrastive Learning: Aims to pull representations of similar samples closer and push dissimilar ones apart. TR methods adapt this by defining positive and negative pairs at different levels (frame, subword, character, patch).
SeqCLR [aberdam2021sequence] was the first to apply contrastive learning to TR, adapting SimCLR for CRNNs by defining instances at frame, subword, or word levels.
PerSec [liu2022perceiving] uses hierarchical contrastive learning, perceiving stroke-semantic context at different encoder layers to address misalignment and semantic continuity issues.
STR-CPC [jiang2022scene] uses Contrastive Predictive Coding to capture the sequential correlation within text instances, addressing information overlap in CRNN features.
ChaCo [zhang2022chaco] focuses on character-level contrastive learning using a Character Unit Crop module.
CMT-Co [zhang2022cmt] combines character-level movement prediction with word-level contrastive learning for HTR.
RCLSTR [zhang2023relational] enhances textual relations through rearrangement, hierarchy, and interaction with symmetric KL divergence.
Geometric Transformation: Learns by predicting transformations applied to the input.
Flip [penarrubia2024spatial] proposes predicting horizontal/vertical flips as a relevant pretext task for HTR, leveraging its spatial characteristics.
Puzzle Solvers: Learns by predicting the correct arrangement of shuffled patches.
Sorting [penarrubia2024spatial] trains a model to order shuffled vertical patches, emphasizing the sequential nature of text, though found less effective than Flip for HTR in their paper.
Distillation: Uses a teacher-student framework to prevent neural collapse.
CCD [guan2023self] applies a distillation framework (inspired by DINO) to STR, incorporating a self-supervised character segmentation head.
Generative Approaches: These methods learn by reconstructing or generating parts of the data.
Image Colorization: Predicts the color of a grayscale image.
SimAN [luo2022siman] recovers the original format of augmented image crops using non-augmented ones, incorporating a Similarity-Aware Normalization module and adversarial loss.
Masked Image Modeling (MIM): Predicts masked patches of an input image. Popularized with ViTs.
Text-DIAE [souibgui2023text] uses a denoising autoencoder to reconstruct text images degraded by masking, blur, or noise, focusing on masking for TR pre-training.
Dual-MAE [qiao2023decoupling] employs two masking strategies (intra-window and window) to learn visual and contextual information separately for STR.
MaskOCR [lyu2023maskocr] pre-trains both the ViT encoder (using MIM like CAE) and the TD decoder (using synthetic data to predict missing characters).
Hybrid Approaches: Combine different SSL principles.
DiG [yang2022reading] unifies generative (SimMIM) and contrastive (MoCo v3) learning for STR.
SSM [gao2024self] forces the model to reconstruct original and distorted images from a superimposed input, combining generative and contrastive elements, and focusing on linguistic relationships.
Benchmarking
The paper details the standard datasets and evaluation protocols for SSL-TR.
Datasets:
STR uses large synthetic datasets (SynthText, MJSynth) for training and a set of real benchmark datasets (IIIT, SVT, IC13 for regular text; IC15, SP, CT for irregular text) for evaluation. Recent methods also use large unlabeled real datasets (UTI-100M, URD, Real-U) for pre-training and smaller labeled real datasets (ARD, Real-L) for fine-tuning.
HTR commonly uses labeled word-level datasets (IAM, CVL, RIMES, G.W.). SSL methods pre-train on the full training set (or combinations) and fine-tune on the training set or a reduced percentage (e.g., 5% or 10%).
SSL Evaluation Protocols:
Quality evaluation: Assesses the learned representations by freezing the pre-trained encoder layers and only fine-tuning the decoder or remaining layers.
Semi-supervised evaluation: Fine-tunes the entire pre-trained model on labeled data. This is often used to demonstrate effectiveness with limited labeled data and provides a fairer comparison across methods with varying pre-training strategies.
Evaluation Metrics: Standard TR metrics are used: Character Error Rate (CER), Word Accuracy (WAcc), and Single Edit Distance (ED1).
Comparative Analysis of Performance
The authors analyze reported results for STR and HTR.
STR: ViT-based architectures generally outperform CRNNs, attributed to their lower inductive bias and stronger ability to capture global patterns. Performance on traditional benchmarks like IIIT is nearing saturation, while IC15 shows significant recent improvements. Training with real unlabeled data for pre-training and fine-tuning with real labeled data shows clear benefits over synthetic data. However, comparisons are complicated by inconsistencies in the amount and type of training data used across different methods (Table 6). Newer methods often achieve higher performance partly due to using significantly larger and more diverse datasets.
HTR: Performance on datasets like IAM and CVL shows improvement, but there is still substantial room. Semi-supervised settings with limited labeled data (e.g., 5-10% of IAM) show particularly low performance, highlighting the challenge. Architectures with Attention or Transformer decoders tend to perform better. Similar to STR, inconsistent training data usage (e.g., merging IAM and CVL vs. using individual training sets) makes strict comparisons difficult and necessitates standardized protocols.
Conclusions and Future Research
The paper concludes that SSL for TR is a rapidly advancing field, mirroring general SSL trends towards MIM and hybrid approaches, especially with ViTs. However, several challenges remain:
Unexplored SSL Categories: Many SSL approaches (clustering, information maximization, GANs, inpainting) have not been fully explored for TR.
Theoretical Understanding: Deeper theoretical analysis is needed to understand how different SSL paradigms learn visual vs. semantic information and how specific design choices (e.g., contrastive units, handling semantic continuity) impact representation quality.
Efficient SSL: Current methods are computationally expensive. Research into more efficient SSL algorithms tailored for TR is necessary.
Standard Evaluation Protocols: Inconsistent training data usage is a major barrier to fair comparison. Standardized datasets, data splits, and reporting practices are crucial for advancing the field.
Future research directions include shifting STR evaluation to more challenging benchmarks like Union14M-Benchmark [jiang2023revisiting] to address issues like salient, multi-word, and incomplete text. For HTR, future work should explore line-level and page-level recognition using SSL, address the variability in line lengths, and prioritize the creation of large unlabeled real HTR datasets, similar to what has been done for STR. Leveraging SSL to improve robustness to data scarcity, domain shifts, and dataset imbalances in both STR and HTR remains a key opportunity.