GutenOCR: Grounded Prompt-Based OCR
- GutenOCR is a grounded OCR system that fine-tunes Qwen2.5-VL backbones to perform unified document analysis through prompt-based interfaces.
- It delivers comprehensive line and paragraph-level detection, transcription, and region-specific queries, outperforming traditional OCR pipelines.
- The system demonstrates significant composite score improvements on business and scientific documents, validating its robust performance.
GutenOCR is a family of grounded OCR (Optical Character Recognition) front-ends derived by comprehensive fine-tuning of Qwen2.5-VL-3B and Qwen2.5-VL-7B, transforming general-purpose multimodal models into unified, prompt-based vision-language systems suitable for rigorous document analysis. Rather than relying on brittle classical pipelines or rigid page-to-Markdown converters, GutenOCR exposes robust reading, detection, and spatial grounding primitives in a single checkpoint, driven by minimal prompt engineering. Fine-tuned on diverse business documents, scientific articles, and synthetic grounding schemas, GutenOCR supports granular reading and detection tasks with explicit line- and paragraph-level bounding boxes, localized transcription, and conditional region queries, establishing a competitive benchmark for grounded OCR protocols (Heidenreich et al., 20 Jan 2026).
1. Model Architecture and Variants
GutenOCR builds upon publicly available, instruction-tuned Qwen2.5-VL backbones (3B and 7B parameters). Qwen2.5-VL implements a NaViT-style multimodal encoder tightly integrated with a language decoder supporting long-context page images and coordinate-based token grounding. GutenOCR fine-tunes the entire model (no adapters or frozen components) using a strict maximum-likelihood (ML) objective on concatenated prompt and output token sequences. Each training sample is structured as follows:
- Input: Image of a page, optional text query or bounding box, plus a prompt template.
- Target: Token sequence encoding plain text, JSON arrays (boxes or box–text objects), or structured outputs.
The training loss is the cumulative cross-entropy over all output tokens:
Because bounding box coordinates and JSON syntax are co-tokenized with natural language, detection and grounding tasks are framed as pure language modeling. The two primary variants—GutenOCR-3B and GutenOCR-7B—share the architecture, differing only in parameter count and resultant capacity.
2. Training Data and Prompt Templates
The fine-tuning stage draws from five major public sources, encompassing both authentic and synthetic documents:
- OCR-IDL: 26 million business document pages (Amazon Textract).
- TabMe++: 122 thousand invoices and forms (Azure OCR).
- PubMed-OCR: 1.5 million scientific article pages (Google Vision OCR).
- Grounded LaTeX: 3 million synthetic pages with equation-grounded tight boxes.
- SynthDoG Grounding: 1.2 million synthetic prose pages with line-level boxes.
Evaluation is conducted on a held-out set of 10,500 uniformly sampled pages from OCR-IDL, TabMe++, and PubMed-OCR.
Each OCR primitive—full-page reading, full-page detection, conditional detection, and localized reading—leverages a shared prompt interface. Robustness is enhanced by randomly varying prompt templates in terms of determiners and document nouns during training.
Representative Prompts
| Task | Example Prompt | Output Format |
|---|---|---|
| Full-page reading | "Read all text in the provided document and return the result as text." | Plain text |
| Reading (lines) | "Extract line-level JSON objects with text and [x1,y1,x2,y2] boxes." | JSON array |
| Full-page detection | "Without returning text, detect all LINES ... boxes as a JSON array." | JSON array of boxes |
| Conditional detection | "Where does the exact string ‘TOTAL’ ... boxes for all matching lines as JSON." | JSON array of boxes |
| Localized reading | "What text appears inside the region [100,200,400,250] ... recognized text." | Plain text |
All bounding boxes are axis-aligned in pixel coordinates.
3. Grounded OCR Evaluation Protocol
GutenOCR is quantitatively assessed using a unified suite of metrics spanning four task families:
- Text accuracy (reading, localized):
- Character Error Rate (CER):
- Word Error Rate (WER):
Detection quality (detection, conditional) at IoU 0.5: Maximum-weight bipartite matching for predicted boxes and ground truths :
- Precision
- Recall
- F1
- End-to-end structured reading:
- [email protected]: Average CER over matched box–text pairs with IoU 0.5.
- CER: Deterministic 2D reading order linearization, measuring region detection, ordering, recognition, and insertion/deletion failures.
Composite grounded OCR score is defined by averaging for reading tasks and [email protected] for detection tasks across all categories.
4. Experimental Outcomes and Trade-Offs
On the 10.5K held-out business and scientific pages, GutenOCR-7B achieves a composite score increase from 0.396 (Qwen2.5-VL-7B) to 0.819, with key taskwise improvements as follows:
| Task (3B/7B) | Qwen2.5-VL Baseline | GutenOCR |
|---|---|---|
| Full-page text CER (7B) | 0.333 | 0.202 |
| text2d CER (7B) | 0.522 | 0.280 |
| Lines reading CER (7B) | 0.633 | 0.147 |
| Localized reading CER (7B) | 0.530 | 0.129 |
| Full-page detection [email protected] (7B) | 0.111 | 0.787 |
| Conditional detection [email protected] (7B) | 0.285 | 0.882 |
Region- and line-level performance on external benchmarks is substantially enhanced:
- Fox OCR subtasks (7B):
- Page F1: 0.984 (baseline) 0.973 (GutenOCR); CER rises due to complex layout ordering.
- Region CER: 0.163 0.067
- Line CER: 0.701 0.211
- Color-guided CER: 0.109 0.963 (catastrophic forgetting).
- OmniDocBench v1.5 (English pages):
- Text detection [email protected]: 0.02 0.55–0.62.
- Component-level CER worsens marginally, especially on multi-colored backgrounds (e.g., 7B: 0.011 0.024).
- Formula recognition (CDM score): 0.935 0.927 (7B).
This suite of results demonstrates that grounded fine-tuning yields pronounced gains in region/line detection, localized reading, and composite task scores, with explicit trade-offs in layout-sensitive linearization, color-guided extraction, and formula-heavy regions.
5. Practical Integration and Interface
GutenOCR provides a set of stable, prompt-configurable primitives that function as a flexible OCR API for downstream systems, supporting the following use patterns via greedy decoding:
- Full-page, line-level reading (JSON):
Prompt: Extract line-level JSON objects with text and [x1,y1,x2,y2] boxes from this page.
Output: JSON array of objects—{"text": ..., "bbox": [...]}.
- Conditional detection:
Prompt: Where does the exact string ‘Total Amount’ appear ...? Output: JSON array of matching line boxes.
- Localized reading:
Prompt: What text appears inside the region [400,300,850,400] ...? Output: Transcribed text string.
- Full-page detection:
Prompt: Detect all PARAGRAPHS ... return a JSON array of boxes. Output: List of bounding boxes.
All functions operate from a single checkpoint, allowing seamless prompt-based task switching without model reloading or routing logic. This suggests high utility for modular document intelligence pipelines.
6. Limitations and Prospective Directions
GutenOCR’s targeted specialization yields a recall-oriented, grounded OCR system that exhibits several notable limitations:
- Total failure on color-pointer tasks (“color-guided CER”).
- Slightly reduced performance in multi-colored-region OCR (OmniDocBench).
- Negative transfer impact on formula recognition (CDM score drop).
- Absence of table structure extraction, math layout parsing, cross-page linking, and discourse segmentation.
- Increased CER in page-to-Markdown order, despite maintaining high F1.
- Unknown runtime/throughput at high-resolution inference.
A plausible implication is that broadening the training mixture (including math/table-rich supervision and color-cue grounding), adopting richer decoding interfaces, and developing integration modules for retrieval and QA atop GutenOCR’s grounded outputs could enable a “document hologram”: a unified, queryable representation capturing content, layout, semantics, and direct pixel provenance. Future research will need to address interface expansion, algorithmic robustness across diverse document genres, and throughput scalability (Heidenreich et al., 20 Jan 2026).