General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (2409.01704v1)

Published 3 Sep 2024 in cs.CV

Abstract: Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

PDF Abstract

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

The paper presents a significant evolution in the field of Optical Character Recognition (OCR) by proposing the General OCR Theory (GOT) and introducing an innovative model that addresses the limitations of traditional OCR systems, referred to as OCR-1.0. Traditional OCR methods often rely on multi-modular pipelines, including element detection, region cropping, and character recognition, which are prone to local optima and high maintenance costs. Such systems lack generalizability and usually require different networks tailored to specific OCR sub-tasks.

Model Architecture and Training

The proposed GOT model, equipped with 580M parameters, adopts a unified, end-to-end encoder-decoder architecture—eschewing the modular design of legacy OCR systems. The encoder, a high-compression component, transfers optical images to tokens using the VitDet architecture, optimized through multi-stage training. The decoder, with a long-context length, is a Qwen-0.5B model that facilitates output generation in multiple formats, including plain text, Markdown, TikZ, and SMILES.

To meet the requirements of OCR-2.0, the model supports versatile input types (scene and document-style images) and offers flexibility in output formatting. Additionally, it features interactive OCR capabilities for region-level recognition, benefiting from coordinates or colors as guides. Moreover, the model adapts to dynamic resolution processing for ultra-high-resolution images and supports multi-page document OCR, enhancing its practical application scope.

Data Generation and Training Strategies

The paper meticulously details the synthetic data generation process that underpins the model's training, ensuring coverage across diverse OCR tasks. The data engines generated significant amounts of plain text OCR data, fine-grained OCR datasets, and more sophisticated synthetic datasets involving math formulas, molecular formulas, tables, and charts:

Plain OCR Data: Comprising 5M image-text pairs, incorporating both scene text and document OCR sources, enriched with English and Chinese examples, sourced and rendered from Laion-2B, Wukong, and various open-access PDFs.
Formatted OCR Data: Utilizing Mathpix-markdown-it for math and molecular formula rendering, and LaTeX for tables, enabling the model to handle complex structured textual elements.
General OCR Data: Introducing tasks such as sheet music recognition (rendered via Verovio), geometric shape recognition (via TikZ), and chart OCR (using Matplotlib and Pyecharts).

Experimental Results and Analysis

Empirical results demonstrate that GOT substantially surpasses current state-of-the-art models across various OCR tasks:

Plain Document OCR: The model showcases formidable performance, with notable improvements in edit distance, F1-score, and BLEU metrics, both in English and Chinese text recognition.
Scene Text OCR: GOT's efficacy extends to natural image OCR, where it achieves higher precision, recall, and overall accuracy compared to competing models.
Formatted Document OCR: The dynamic resolution approach significantly enhances its ability to interpret and reproduce complex documents, including tables and formulas with higher fidelity.
Fine-grained OCR: GOT's capability to recognize text within specified regions, guided by coordinates or colors, is validated with robust performance metrics against established benchmarks.
General OCR Tasks: The model exhibits competence in handling unconventional OCR applications such as sheet music and geometric shapes, further broadening its practical utility.

Implications and Future Directions

The research introduces a major shift toward a unified OCR model that not only addresses the deficiencies of OCR-1.0 systems but also integrates advanced features typically associated with LVLMs, maintaining a reasonable computational footprint. This generalized OCR-2.0 approach signifies a substantial step forward in democratizing access to intelligent character recognition across diverse domains, from scientific publications to data visualization tools.

Future developments could further enhance the model's robustness and applicability, including support for more languages and the inclusion of more complex artificial signals. The continued evolution of synthetic data generation and the refinement of training strategies may drive further advancements in OCR technology, potentially converging toward an all-encompassing model capable of seamless text and structural element recognition.

In conclusion, the proposed GOT model represents a key innovation in OCR, promising improvements in efficiency, versatility, and accuracy, thus facilitating enhanced document analysis and text recognition capabilities across a broad spectrum of applications.