General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
The paper presents a significant evolution in the field of Optical Character Recognition (OCR) by proposing the General OCR Theory (GOT) and introducing an innovative model that addresses the limitations of traditional OCR systems, referred to as OCR-1.0. Traditional OCR methods often rely on multi-modular pipelines, including element detection, region cropping, and character recognition, which are prone to local optima and high maintenance costs. Such systems lack generalizability and usually require different networks tailored to specific OCR sub-tasks.
Model Architecture and Training
The proposed GOT model, equipped with 580M parameters, adopts a unified, end-to-end encoder-decoder architecture—eschewing the modular design of legacy OCR systems. The encoder, a high-compression component, transfers optical images to tokens using the VitDet architecture, optimized through multi-stage training. The decoder, with a long-context length, is a Qwen-0.5B model that facilitates output generation in multiple formats, including plain text, Markdown, TikZ, and SMILES.
To meet the requirements of OCR-2.0, the model supports versatile input types (scene and document-style images) and offers flexibility in output formatting. Additionally, it features interactive OCR capabilities for region-level recognition, benefiting from coordinates or colors as guides. Moreover, the model adapts to dynamic resolution processing for ultra-high-resolution images and supports multi-page document OCR, enhancing its practical application scope.
Data Generation and Training Strategies
The paper meticulously details the synthetic data generation process that underpins the model's training, ensuring coverage across diverse OCR tasks. The data engines generated significant amounts of plain text OCR data, fine-grained OCR datasets, and more sophisticated synthetic datasets involving math formulas, molecular formulas, tables, and charts:
- Plain OCR Data: Comprising 5M image-text pairs, incorporating both scene text and document OCR sources, enriched with English and Chinese examples, sourced and rendered from Laion-2B, Wukong, and various open-access PDFs.
- Formatted OCR Data: Utilizing Mathpix-markdown-it for math and molecular formula rendering, and LaTeX for tables, enabling the model to handle complex structured textual elements.
- General OCR Data: Introducing tasks such as sheet music recognition (rendered via Verovio), geometric shape recognition (via TikZ), and chart OCR (using Matplotlib and Pyecharts).
Experimental Results and Analysis
Empirical results demonstrate that GOT substantially surpasses current state-of-the-art models across various OCR tasks:
- Plain Document OCR: The model showcases formidable performance, with notable improvements in edit distance, F1-score, and BLEU metrics, both in English and Chinese text recognition.
- Scene Text OCR: GOT's efficacy extends to natural image OCR, where it achieves higher precision, recall, and overall accuracy compared to competing models.
- Formatted Document OCR: The dynamic resolution approach significantly enhances its ability to interpret and reproduce complex documents, including tables and formulas with higher fidelity.
- Fine-grained OCR: GOT's capability to recognize text within specified regions, guided by coordinates or colors, is validated with robust performance metrics against established benchmarks.
- General OCR Tasks: The model exhibits competence in handling unconventional OCR applications such as sheet music and geometric shapes, further broadening its practical utility.
Implications and Future Directions
The research introduces a major shift toward a unified OCR model that not only addresses the deficiencies of OCR-1.0 systems but also integrates advanced features typically associated with LVLMs, maintaining a reasonable computational footprint. This generalized OCR-2.0 approach signifies a substantial step forward in democratizing access to intelligent character recognition across diverse domains, from scientific publications to data visualization tools.
Future developments could further enhance the model's robustness and applicability, including support for more languages and the inclusion of more complex artificial signals. The continued evolution of synthetic data generation and the refinement of training strategies may drive further advancements in OCR technology, potentially converging toward an all-encompassing model capable of seamless text and structural element recognition.
In conclusion, the proposed GOT model represents a key innovation in OCR, promising improvements in efficiency, versatility, and accuracy, thus facilitating enhanced document analysis and text recognition capabilities across a broad spectrum of applications.