OCRVerse: Unified OCR & Code Generation

Updated 30 January 2026

OCRVerse is a holistic optical character recognition framework that unifies text-centric reading with vision-centric code generation.
The model employs a two-stage SFT-RL training process with domain-specific data engineering to achieve competitive and robust performance.
Quantitative results demonstrate OCRVerse's effectiveness in parsing tables, charts, and complex images, matching or exceeding larger models.

OCRVerse is a holistic optical character recognition (OCR) framework designed to unify both text-centric and ^{^{^{^{1^{^{^{^}}}}}}} recognition within a single end-to-end vision-LLM (VLM) architecture. Unlike conventional OCR systems that focus primarily on extracting character-level content from scanned documents or natural scene images, OCRVerse is capable of parsing visually information-dense images such as charts, scientific plots, web pages, and composite diagrams, which require the generation of structured code (e.g., HTML, LaTeX, Python) for faithful representation. By integrating extensive domain-specific data engineering and a staged training methodology, OCRVerse achieves competitive performance comparable to much larger open-source and closed-source models, with robust cross-domain generalization (Zhong et al., 29 Jan 2026).

1. Motivations and Conceptual Foundations

OCRVerse is motivated by the dichotomy observed in real-world visual data: "text-centric" images predominantly contain direct character information (e.g., scanned documents, books), whereas "vision-centric" images encode semantic structure in visual primitives (arrows, geometric shapes, icons) and necessitate code-level output for accurate digital reconstruction. Existing OCR pipelines excel at text-centric tasks but fail to interpret vision-centric content. Modern VLMs often lack explicit modeling for layout structure or code semantics, resulting in hallucinated or malformed outputs. The foundational aim of OCRVerse is to bridge this fragmentation by providing a unified, holistic OCR solution operating entirely within a VLM framework, supporting seamless character reading and structured code generation (Zhong et al., 29 Jan 2026).

2. Data Engineering for Holistic OCR

The OCRVerse training corpus encompasses nine text-centric and six vision-centric domains:

Text-Centric Domains	Vision-Centric Domains
Natural scenes, books, magazines, papers, reports, slides, exam sheets, notes, newspapers	Charts, web pages, icons, geometry diagrams, circuit schematics, molecular structures

Text-centric data sources include open-source scene-text datasets (LSVT, TextOCR, PDF-A, DocStruct4M, DocGenome, IAM, ORAND-CAR, HME) and large-scale crawled PDFs of books, magazines, academic papers, reports, and slide decks. Synthetic augmentation is performed for complex exam questions and mathematical formulas using parameterized HTML templates and MathJax rendering. Data preprocessing involves quality filtering (removal of low-quality pages, formula extraction), annotation with OCR tools, VLM-based re-annotation (e.g., Qwen2.5-VL-72B), and headless-browser rendering for HTML/Markdown.

For vision-centric domains, OCRVerse integrates chart-to-code datasets (MCD, MSRL), web-to-HTML datasets (Web2M, Web2Code), SVG datasets (UniSVG), TikZ/geometry datasets (DaTikZ-v3, Cosyn-400k), and mermaid-based molecule datasets. Domain-specific cleaning ensures complete LaTeX environments, validated SVG primitives, and exclusion of embedded images in HTML. A bootstrapped self-annotation process leveraging specialized models scales code annotation for remaining unlabeled images (Zhong et al., 29 Jan 2026).

3. Two-Stage SFT-RL Multi-Domain Training

OCRVerse training is structured as a two-stage curriculum on the 4B-parameter Qwen3-VL backbone:

Stage 1: Supervised Fine-Tuning (SFT)

The image encoder and vision-language adapter are frozen.
The autoregressive decoder is tuned using cross-entropy minimization over mixed text- and vision-centric examples:

$L_s(\theta) = - \mathbb{E}_{(x, y) \in D_s} \sum_{t=1}^{T} \log P_\theta(y_t | x, y_{<t})$

All domains are shuffled into SFT batches to encourage a unified latent space capturing characters, layout cues, and code tokens.

Stage 2: Reinforcement Learning (RL)

Domain-specific output formats are refined via Group-Relative Proximal Policy Optimization (GRPO) with custom reward functions $r_a(x, y)$ per domain.
For each input $x$ , $G$ outputs $\{o_i\}$ are sampled, rewards $\{R_i\}$ are computed, and normalized advantages $A_i = (R_i - \mu_G)/\sigma_G$ optimize:

$L_{RL}(\theta) = \mathbb{E}_{x, \{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min(\rho_i A_i, \text{clip}(\rho_i, 1 - \epsilon, 1 + \epsilon)A_i) \right]$

where $\rho_i = \pi_\theta(o_i|x)/\pi_{\theta_{\text{old}}}(o_i|x)$ .

Text-centric rewards:

$r_\text{text} = \frac{1}{|C_\text{valid}|} \sum_{c \in C_\text{valid}} r_c(\text{Pred}_c, \text{GT}_c)$

$r_\text{plain} = 1 - (\text{edit-distance}/\text{length})$
$r_\text{formula} = \text{BLEU}(\text{normalize}_\text{LaTeX}(\text{Pred}), \text{normalize}_\text{LaTeX}(\text{GT}))$
$r_\text{table} = \text{TEDS-S}(\text{normalize}_\text{table}(\text{Pred}), \text{normalize}_\text{table}(\text{GT}))$
- Vision-centric rewards:

$r_\text{vision} = \omega_g \cdot s_\text{global}(\text{Img}, \text{Render}) + \omega_l \cdot \frac{1}{N}\sum_{i=1}^N s^{(i)}_\text{local}(\text{patch}_i, \text{patch}'_i)$

Visual fidelity is measured using a pre-trained DINOv2 encoder. Additional format-alignment rewards enforce syntactic validity of generated code.

RL data selection employs entropy-based filtering for text domains (targeting high-uncertainty, high-complexity samples) and visual complexity-based sampling for vision domains.

4. Model Architecture and Tokenization

OCRVerse employs the Qwen3-VL 4B architecture:

Vision Encoder: Multi-scale Transformer (Swin-like) producing patch embeddings.
Vision-Language Adapter: Lightweight cross-modal layers project visual features into autoregressive decoder key/value space.
Decoder: Three-layer autoregressive Transformer LM with causal attention augmented by cross-attention to visual features.
Tokenization: Unified vocabulary includes subword units for plain text, reserved tokens for code keywords, tags, and operators (HTML angle brackets, LaTeX backslashes, Python indent markers).

Distinctively, there is no separate layout or detection head; model structure is implicitly captured via shared cross-attention layers (Zhong et al., 29 Jan 2026).

5. Quantitative Results

Text-Centric Tasks (OmniDocBench v1.5)

Metrics: Text edit distance (lower better), Formula CDM (↑), Table TEDS/TEDS-S (↑), Reading order edit distance (↓).
Overall = $((1 - \text{Text\_edit}) \times 100 + \text{Table\_TEDS} + \text{Formula\_CDM})/3$
Performance: OCRVerse (4B) scores Overall 89.23, surpassing Qwen2.5-VL-72B (87.02), Gemini-2.5 (88.03), and matching 7–38B specialized models. Formula CDM: 87.13 (best); Table TEDS: 85.77; Reading-order edit: 0.068 (gap to layout-aware hybrids remains).

Vision-Centric Tasks

Benchmark	OCRVerse Score	Comparator
ChartMimic	84.8% exec	Qwen3-VL-8B 78.3%
UniSVG	76.3	GPT-5 77.3
Design2Code (low-level)	85.7	-
Design2Code (high-level)	87.4 CLIP sim.	-
Image2LaTeX (render)	88.7% success, EMS 63.1	best overall
ChemDraw	89.1% exec, Tanimoto 54.7	-

Performance is competitive or superior in several tasks, demonstrating notable parameter efficiency against larger models including closed-source GPT-5 and Gemini-2.5-Pro (Zhong et al., 29 Jan 2026).

6. Ablation Studies and Technical Insights

Controlled ablation reveals that SFT alone establishes strong multi-domain baselines (OmniDocBench ~85 Overall). Integration of RL yields a +4 point improvement on text tasks and a relative +6–8% boost in ChartMimic execution rate. This confirms the role of SFT in shared representation learning, while RL’s domain-specific rewards resolve format conflicts and improve structural fidelity. A trade-off is increased training complexity and potential over-specialization, which must be managed using clipping ( $\epsilon$ ) and group normalization.

7. Limitations and Prospective Directions

OCRVerse does not currently model explicit region-level layout, which constrains table parsing and reading-order accuracy. Future development will target the integration of lightweight layout priors (e.g., region proposals, graph-based tokens), expand coverage to interactive UIs and 3D molecular renderings, and refine tokenization strategies for long code sequences. Prospective enhancements include gradient-based multi-task weighting during SFT to further mitigate inter-domain conflicts prior to RL. This suggests further architectural and training optimizations may yield improved parsing of visually and structurally complex content (Zhong et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OCRVerse.