Papers
Topics
Authors
Recent
Search
2000 character limit reached

GutenOCR Framework: Unified OCR & Grounding

Updated 12 March 2026
  • GutenOCR is a unified OCR system that integrates reading, detection, and spatial grounding via prompt-driven instructions using Qwen2.5-VL models.
  • It employs a multimodal Transformer and cross-attention with a four-stage curriculum fine-tuned on diverse business, scientific, and synthetic datasets.
  • Empirical results demonstrate significant OCR accuracy improvements and enhanced spatial detection, while highlighting trade-offs in page linearization and color-guided reasoning.

GutenOCR is a family of grounded OCR front-ends built upon the Qwen2.5-VL-3B and Qwen2.5-VL-7B “Instruct” vision-LLMs, specialized for document understanding through instruction-driven interaction. The models unify reading, detection, and grounding primitives within a single checkpoint, enabling prompt-based invocation for tasks including full-page reading, region-specific OCR, line and paragraph localization, and string-based querying. The design leverages extensive fine-tuning on heterogeneous business, scientific, and synthetic grounding datasets, resulting in marked improvements in composite grounded OCR accuracy on in-domain and out-of-domain benchmarks, while delineating explicit trade-offs in page linearization, color-guided reasoning, and formula-rich document handling (Heidenreich et al., 20 Jan 2026).

1. Model Architecture and Fine-Tuning

GutenOCR adopts a unified @@@@1@@@@ backbone, with Qwen2.5-VL-3B and Qwen2.5-VL-7B serving as the foundational architectures. The vision encoder is a NaViT-style Transformer, ingesting single rasterized pages at 72 dpi with no cropping or tiling, supporting holistic page-level context. The language decoder is an autoregressive Transformer, instruction-tuned and employing a shared tokenizer without introducing new tokens.

Fusion between the visual and textual streams is implemented via cross-attention at each decoder layer, following the canonical Transformer paradigm:

A=softmax(QKd)V,A = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} \right) V,

with QRt×dQ\in\mathbb{R}^{t\times d}, K,VRn×dK,V\in\mathbb{R}^{n\times d} representing query, key, and value matrices from text and image patch embeddings, respectively.

All model parameters are trainable, with no frozen layers or adapters. Fine-tuning proceeds in a four-stage curriculum over total sequence length (prompt plus output), ranging from less than 2,000 to up to 16,000 tokens. The loss function combines cross-entropy terms for OCR text prediction and grounding coordinate generation, operationalized as a unified next-token prediction loss:

L=t=1Tlogpθ(yty<t,prompt,image).\mathcal{L} = -\sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t},\,\mathrm{prompt},\,\mathrm{image})\,.

Optimization utilizes AdamW (learning rate 1×1061\times10^{-6}, weight decay 0.01), with batch size 128 on 8 H100 GPUs in bf16 precision.

2. Prompt-Based Interface and Task Conditioning

GutenOCR exposes its full range of OCR and grounding capabilities through a prompt-based interface, enabling flexible task specification via natural language. Four primary OCR primitives are unified within the same model checkpoint:

  • Full-page reading:
    • text: Extracts a single transcript.
    • text2d: Outputs a layout-preserving transcript, encoding spatial structure via whitespace and newlines.
    • lines / paragraphs: Returns a JSON array of text strings with associated bounding boxes.
  • Full-page detection:
    • Returns JSON arrays of detected region bounding boxes, omitting text.
  • Conditional detection:
    • Image plus query string yields bounding boxes for all matching lines.
  • Localized reading:
    • Given an image and input bounding box, returns the transcript of text within the specified region.

Prompt templates standardize these capabilities. For example:

  • "Read all text in the attached document and return the result as text2d."
  • "Without returning any text, detect all LINES in this image and return their boxes as JSON."
  • "Where does the exact string ‘invoice’ appear in this document? Return an empty array if none."
  • "What text appears inside the region [320,150,780,260] of this page? Return only the recognized text."

Bounding-box encoding adopts axis-aligned integer pixel coordinates [x1,y1,x2,y2][x_1, y_1, x_2, y_2], clipped to image bounds and dropped if degenerate. Outputs are fully normalized before evaluation, and irreparable JSON is treated as maximal error.

3. Training Data and Synthetic Grounding

GutenOCR’s fine-tuning corpus integrates large-scale real-world business, scientific, and synthetic datasets, engineered for document OCR and grounding fidelity.

Real-world corpora:

  • OCR-IDL: 4.6 million documents, 26 million pages of business forms and invoices, annotated via Amazon Textract.
  • TabMe++: 44,800 documents (122,500 pages) from Azure OCR.
  • PubMed-OCR: 209,500 documents, 1.5 million scholarly pages—equation-rich, annotated by Google Vision OCR.

Synthetic grounding sources:

  • Grounded LaTeX (GL, 3M pages): Raw LaTeX snippets mined from Wikimedia, normalized (KaTeX), rendered at random scales, rotations, and positions for granular formula localization.
  • SynthDoG Grounding (SDG, 1.2M pages): Synthetic text with instrumented line-level boxes/transcripts, random fonts, sizes, backgrounds, and augmentation via blur/compression/geometric jitter.

Curriculum mixture is staged by total sequence length: | Stage | Composition | |------------|----------------------------------------------------------------------------------------| | Stage 1 | GL ≈ 48%, OCR-IDL ≈ 45%, SDG ≈ 6%, TabMe++ ≈ 0.4% (<2k tokens) | | Stage 2 | OCR-IDL ≈ 95%, TabMe++ ≈ 5% (2k–8k tokens) | | Stage 3a | OCR-IDL 65%, TabMe++ 3%, PubMed-OCR 32% (2k–8k tokens) | | Stage 3b | PubMed-OCR 100% (8k–16k tokens) |

A plausible implication is that curriculum-driven mixture balances early-stage generalization with late-stage adaptation to long-form, complex scientific layouts.

4. Grounded OCR Evaluation Protocol

GutenOCR evaluation employs metrics targeting both transcription fidelity and spatial grounding.

  • Text accuracy:

    CER(y^,y)=EditDistance(norm(y^),norm(y))max{1,max(norm(y^),norm(y))}\mathrm{CER}(\hat{y}, y) = \frac{\mathrm{EditDistance}(\mathrm{norm}(\hat{y}), \mathrm{norm}(y))}{\max\{1,\, \max(|\mathrm{norm}(\hat{y})|, |\mathrm{norm}(y)|)\}} - Word Error Rate (WER):

    WER=S+D+IN\mathrm{WER} = \frac{S + D + I}{N}

    with SS (substitutions), DD (deletions), II (insertions), NN (reference tokens).

  • Detection metrics:

    • Intersection-over-Union (IoU) at threshold τ=0.5\tau=0.5, with box prediction matching via Hungarian assignment.
    • Precision, Recall, F1 computed as

    Precτ=TPTP+FP,Recτ=TPG,F1τ=2PrecτRecτPrecτ+Recτ\mathrm{Prec}_{\tau} = \frac{TP}{TP+FP}, \quad \mathrm{Rec}_{\tau} = \frac{TP}{|G|}, \quad \mathrm{F1}_{\tau} = \frac{2\,\mathrm{Prec}_{\tau}\mathrm{Rec}_{\tau}}{\mathrm{Prec}_{\tau} + \mathrm{Rec}_{\tau}}

  • End-to-end grounded reading:

    • [email protected]: Mean CER over matched box–text pairs (IoU ≥0.5).
    • CERₑ₂ₑ: CER after linearizing all predicted boxes in reading order as a text2d string of the full page.
  • Composite grounded OCR score: Per-task aggregation over reading, detection, and conditional localization.

composite=1Mi=1Mscorei,scorereading=1ϵ,scoredetection=F1@0.5\text{composite} = \frac{1}{M}\sum_{i=1}^M \text{score}_i, \quad \text{score}_\text{reading} = 1 - \epsilon, \quad \text{score}_\text{detection} = \mathrm{[email protected]}

Evaluation on Fox and OmniDocBench v1.5 spans in-domain and out-of-domain generalization, assessing line/region-level OCR, text recall, and robustness to challenging layouts.

5. Empirical Results and Trade-Off Analysis

Fine-tuning on the described curricula yields substantial score improvements:

  • In-domain (business/scientific):
    • Qwen2.5-VL-7B backbone composite = 0.396 → GutenOCR-7B = 0.819.
    • Qwen2.5-VL-3B backbone = 0.348 → GutenOCR-3B = 0.811.
  • Fox Benchmark (fine-grained OCR tasks):
    • Page-agnostic token F1: GutenOCR-3B = 0.988 (up from 0.961).
    • Region-level CER: 3B backbone 0.260 → GutenOCR-3B 0.053; 7B backbone 0.163 → 0.067.
    • Line-level CER: 3B backbone 0.817 → 0.240; 7B backbone 0.701 → 0.211.
    • Notable trade-off: Page CER degrades (e.g., GutenOCR-3B 0.138 vs. backbone 0.051), reflecting altered reading order from enhanced layout preservation.
    • Color-guided OCR displays catastrophic forgetting: e.g., 3B backbone 0.768 → GutenOCR-3B 0.940.
  • OmniDocBench v1.5 (out-of-domain stress test):
    • Text detection [email protected] (line-level): Qwen2.5-VL backbones ≈0.02 → GutenOCR 0.55–0.62.
    • English span CER: backbone 0.018→GutenOCR-3B 0.028; backbone 0.011→GutenOCR-7B 0.024, with degradation most acute on multi-color backgrounds.
    • Formula recognition: Mild negative transfer; e.g., 3B backbone CDM 0.936/CER 0.189 → GutenOCR-3B 0.866/CER 0.294.

Trade-offs highlighted:

  • Page-level linearization: The “text2d” mode enhances layout fidelity but increases CER and decreases canonical Markdown rendering order.
  • Color-guided OCR: Absence of color-referring tasks in fine-tuning data erases the backbone model’s zero-shot color reasoning.
  • Formula-heavy layouts: Reduced formula OCR performance, particularly in 3B, reflects the distributional focus of training data.

6. Applications, Limitations, and Implications

GutenOCR demonstrates that a unified, single-checkpoint VLM can effectively serve as a generalist front-end for document OCR, integrating full-page and localized text extraction, spatial grounding, and conditional querying through prompt-based task specification. The empirical performance gains underscore the benefits of targeted fine-tuning on structurally and semantically diverse corpora.

Observed trade-offs, such as degradation in color-guided OCR and formula transcription, suggest the importance of balanced data coverage and indicate current model limitations in handling document spectrum edge cases. These phenomena highlight the requirement for continued investigation into richer “hologram” document representations that jointly encode grounding, structure, and semantics, as well as for strategies to mitigate catastrophic forgetting during specialization (Heidenreich et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GutenOCR Framework.